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Speech  Synthesis  as  a Tool  for  the  Study  of  Speech  Production* 
Franklin  S.  Cooper,  Paul  Mermelstein  and  Patrick  W.  Nye 


ABSTRACT 


Those  elements  of  articulation  that  are  essential  to  the 
communicative  role  of  speech  can  be  studied  in  ways  that  have  proved 
successful  in  discovering  the  acoustic  cues  for  speech  perception. 

The  method  proposed  is  to  synthesize  speech  with  an  appropriate 
articulatory  model  and  evaluate  the  output  signal  by  ear.  In  this 
way,  hypotheses  about  articulatory  cues  for  speech  perception  can  be 
tested  directly  in  terms  of  the  intelligibility  of  the  synthetic 
speech . 

Control  of  the  synthesis — by  hand  or  by  rule — will  be  in  terms 
of  movements  of  the  model's  articulators,  but  with  access  available 
also  to  intermediate  stages  for  manipulation  of  vocal  tract  shape  or 
acoustic  spectrum.  A primary  consideration  is  to  make  the  control 
of  synthesis  highly  interactive;  that  is,  displays  and  controls  will 
be  conceptually  convenient  and  easy  to  operate,  and  the  synthetic 
speech  will  be  available  for  listening  very  soon  after  changes  are 
made  in  the  control  parameters.  Quality  and  naturalness  of  the 
synthetic  speech  are  secondary  considerations,  since  the  main  objec- 
tive is  good  intelligibility  with  minimal  articulatory  descriptions. 

Our  intent  in  this  paper  is  to  describe  some  research  studies  that  we  are 
undertaking  and  to  explain  our  reasons  for  choosing  speech  synthesis  and  the 
class  of  research  questions  that  synthesis  as  a methodology  implies.  Briefly, 
we  wish  to  learn  what  parts  of  the  complex  articulatory  events  of  speech 
production  are  actually  carrying  the  message,  that  is,  what  articulatory  cues 
the  speaker  must  produce  in  order  that  the  listener  will  understand  what  was 
said.  We  think  of  this  as  a search  for  the  articulatory  cues  that  parallels 
earlier  work  we  have  done  on  searching  for  the  acoustic  cues  in  speech. 

There  are  close  parallels  between  the  two  kinds  of  search,  and  we  have 
found  it  useful  in  planning  the  work  on  articulatory  cues  to  draw  analogies 
with  our  experience  in  searching  for  acoustic  cues.  Hence,  we  will  speak  of 
that  experience  in  presenting  our  plans.  We  will  even  digress  into  a brief 
description  of  a new  pattern  playback  we  have  built;  it  will  be  useful  in  the 
planned  studies  even  though  it  was  designed  primarily  for  research  in  acoustic 
cues . 


•■This  paper  will  appear  in  the  U.S.  Japan  Joint  Seminar  on  Dynamic  Aspects  of 
Speech  Production,  ed . by  M.  Sawashima  and  F.  S.  Cooper,  Univ.  of  Tokyo 
Press . 
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We  have  usually  spoken  of  speech  synthesis  as  a tool  for  the  study  of 
speech  perception.  But  the  acoustic  cues  we  found  all  seemed  to  point  back  to 
articulation,  implying  that  we  were,  in  fact,  studying  production  by  way  of 
perception.  Thus,  the  parallels  between  our  earlier  work  and  the  planned  work 
can  be  viewed  in  this  way:  both  were  concerned  with  speech  production,  though 
the  earlier  work  was  on  cues  at  the  acoustic  level,  whereas  the  planned  work 
is  on  cues  at  the  articulatory  level. 

In  either  case,  the  distinguishing  characterist ics  of  the  methodology  are 
that  it  seeks  to  find  the  principal  carriers  of  information,  that  it  tests  for 
these  cues  by  perceptual  methods,  and  that  it  uses  synthetic  speech  to  do  so. 
Obviously,  speech  is  the  required  stimulus  when  the  perception  of  a message  is 
to  be  tested,  and  synthetic  speech  has  the  very  great  advantage  that 
systematic  manipulation  of  the  stimuli  is  possible,  either  at  the  acoustic 
level  or  at  the  articulatory  level  that  precedes  it. 

Research  Methods : from  Acoustic  Cues  to  Articulatory  Cues 

The  method  we  used  in  searching  for  the  acoustic  cues,  often  called 
"hypothesize-and-test" , proved  well  suited  to  that  task  (Liberman  and  Cooper, 
1972).  We  think  it  will  be  equally  effective  in  the  search  for  the 
articulatory  cues.  The  earlier  work  was,  in  fact,  modeled  on  the  chemist's 
customary  technique  of  testing  his  analytic  conclusions  by  synthesizing  the 
suspected  compound  and  comparing  properties.  We  started  with  the  patterns  we 
thought  we  could  see  in  sound  spectrograms  and  regenerated  sound  from  such 
patterns  with  a device  we  built  for  that  purpose,  namely,  the  Pattern 
Playback.  In  using  it,  a speaker  produces  an  utterance  from  which  the 
experimenter  prepares  a spectrogram.  Guided  by  this  spectrogram,  a schematic 
copy  is  painted  and  passed  to  the  Pattern  Playback  for  conversion  into 
synthetic  speech.  Now  the  two  speech  samples,  the  natural  and  the  synthetic, 
are  compared  to  determine  bj£  ear  whether  the  essential  acoustic  cues  have 
survived  in  the  painted  copy.  The  procedure  is  highly  interactive.  The  user 
is  given  the  opportunity  to  rapidly  insert  or  delete  spectral  features  at  will 
and  to  immediately  assess  their  importance  by  listening  to  the  synthetic 
output  and  comparing  it  with  the  natural  speech  sample. 

The  principal  ways  in  which  we  propose  to  model  our  new  procedures  on  the 
old  are  by  providing  the  means  to  obtain  results  quickly,  to  make  modifica- 
tions to  the  data  interactively  by  hand,  and  to  compare  the  outputs  at  a 
variety  of  different  levels,  but  especially  at  the  perceptual  level.  The 
organization  of  the  research  method  is  illustrated  in  Figure  1,  which  shows 
three  ways  to  experiment  on  speech  that  is  generated  by  a real  speaker,  an 
articulatory  model,  or  a terminal  analog  speech  synthesizer. 

For  articulatory  synthesis,  we  may  compare  the  articulatory  control  data 
for  a particular  articulation  with  EMG  data  from  our  physiological  research, 
especially  as  to  relative  timing  of  events.  Likewise,  we  may  compare,  almost 
directly,  the  vocal  tract  shape  for  articulatory  synthesis  with  X-ray  and 
fiberoptic  data  measured  from  an  actual  vocal  tract.  Differences  in  the 
moment -by-moment  vocal  tract  configurations  will  indicate  where  improvements 
might  be  made  in  the  synthesis.  When  rules  have  been  used  to  compute  the 
control  signals  and  vocal  tract  configurations,  means  will  also  be  available 
to  override  these  controls  and  to  make  changes  in  the  vocal  tract  shape 
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tory  model  can  be  used  to  generate  synthetic  speech  and  thereby  test  hypotheses  about 
articulatory  cues  for  speech  perception;  (3)  likewise,  a terminal  analog  synthesizer  can 
be  used  to  test  hypotheses  embedded  in  rules  for  synthesis . The  key  operation  in  all 
three  kinds  of  experimentation  is  perceptual  evaluation  of  the  synthetic  speech,  coupled 
with  visual  comparison  of  the  spectra  of  synthetic  and  real  speech. 


directly  by  hand.  This  facility  will  be  useful  in  a number  of  experimental 
situations  where  it  is  desirable  to  examine  the  acoustic  effects  of  individu- 
ally specified  articulatory  movements. 

As  a final  step  in  the  above  procedures,  the  output  signal  is  presented 
to  listeners,  who  are  asked  to  make  relative  judgments  about  the  speech,  or 
absolute  judgments  about  its  intelligibility  or  adequacy.  Exploratory  manipu- 
lations and  informal  listenings  will  usually  be  followed  by  formal  group 
tests . 

Modeling  the  Speech  Process 

In  representing  the  speech  process  by  a model  (or  synthesizer)  and  in 
manipulating  it  with  control  parameters  that  specify  the  phonetic  elements  of 
the  message,  the  choice  of  level  of  representation  is  crucial.  Moreover,  that 
choice  hinges  on  a number  of  considerations:  intended  use,  feasibility, 
conceptual  convenience,  and  available  knowledge  are  the  primary  desiderata. 

If  we  consider  human  speech  production,  we  find  three  distinct  levels  of 
the  articulatory  process  that  lie  downstream  from  the  presumed  neural  levels 
(to  which  we  have  little  or  no  direct  experimental  access): 

1.  The  activity  of  the  individual  muscles  (in  response  to  neuromotor 
commands) . 

2.  The  positions  of  the  articulators  and  their  movement  in  responses  to 
muscle  activity. 

3.  The  corresponding  vocal-tract  shape  in  terms  of  the  cross-sectional 
area  function  of  the  vocal  tract. 

For  the  research  purposes  we  have  in  mind,  namely  an  exploratory  search 
for  the  articulatory  cues,  the  third  and  lowest  articulatory  level  is  not  very 
useful  since,  at  the  level  of  vocal-tract  area  functions,  the  conceptually 
important  entities — the  positions  and  movements  of  individual  articulators — 
have  already  been  merged  into  a single  continuum.  We  will  certainly  wish  to 
observe  the  performance  of  the  model  at  this  level,  and  even  to  exercise 
supervisory  control  over  the  area  functions,  but  primary  conceptualization  and 
control  can  be  done  to  better  advantage  at  the  next  higher  level,  that  is,  by 
manipulating  the  articulators  themselves. 

Would  we  gain  by  working  at  the  still  higher  levels  of  muscle  activity  or 
of  the  neuroraotor  commands  that  control  the  muscles?  The  philosophical 
question  of  where  maximum  simplicity  is  eventually  to  be  found  has  yet  to  be 
convincingly  answered.  For  the  present,  then,  we  rely  on  the  practical 
considerations  that  our  knowledge  (from  articulatory  phonetics  and  cinefluo- 
rography)  is  better  at  level  two  than  at  level  one,  and  that  starting  higher 
in  the  speech  process  would  require  more  parameters  and  an  additional  stage  of 
computation  (to  reach  level  two)  without  compensating  advantages  other  than 
that  electromyographic  information  could  be  applied  more  directly.  For  all 
these  reasons,  we  intend  to  concentrate  on  the  representation  of  phones  and 
features  at  the  second  of  the  levels  listed  above  and  on  transformations  from 
that  level  to  the  speech  signal. 
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Design  Cons iderat ions  for  an  Articulatory  Synthesizer 

We  intend  to  start  our  research  using  an  articulatory  model  developed  by 
Mermelstein  (1973)  that  allows  parametric  specification  in  the  midsagittal 
plane  for  the  position  of  the  lips,  tongue  tip,  tongue  body,  velum,  jaw,  and 
hyoid;  we  will  extend  the  model  by  the  addition  of  a variable  that  produces 
concave/convex  arching  of  the  tongue  blade.  These  parameters  permit  the 
computation  of  vocal-tract  transfer  functions  for  laryngeal  excitation  or  for 
fricative  excitation  at  points  internal  to  the  tract  (Mermelstein,  1972).  The 
model  has  already  demonstrated  a capability  for  matching  vocal  tract  configu- 
rations seen  in  X-ray  movies  and  for  generating  highly  intelligible  VCV 
syllables . 

The  model  does  not  simulate  the  entire  speech-production  system  in  man. 
In  particular,  it  separates  control  of  tne  sources  of  excitation  of  the  vocal- 
tract  resonances  from  control  of  the  changes  in  those  resonances  with  time. 
Since  many  aspects  of  coarticulation  depend  only  on  the  interaction  of  the 
supraglottal  articulators,  only  the  positions  of  these  articulators  are 
computed,  starting  from  a phonetic  specification.  Laryngeal  excitation  param- 
eters (amplitude,  fundamental  frequency,  onset,  and  duration)  are  specified 
explicitly,  and  effects  of  the  supraglottal  system  back  on  the  excitation 
source  are  neglected.  Similarly,  for  frication,  the  amplitude  and  spectrum  of 
the  noise  is  explicitly  specified;  the  output  spectrum  will,  of  course, 
reflect  not  only  the  source  spectrum  but  also  the  filtering  action  of  the 
vocal-tract  cavities  posterior  and  anterior  to  an  assumed  frication  source  at 
the  point  of  maximum  constriction  along  the  tract. 

The  prime  reason  for  not  modeling  directly  the  effects  of  articulator 
movement  on  the  characteristics  of  the  sound  generation  process  is  to  limit 
the  complexity  of  the  simulation.  Thus,  we  do  not  for  the  present  intend  to 
model  laryngeal  action  because  we  do  not  think  that  it  plays  a central  role  in 
the  coart iculat ion  processes  that  we  plan  to  study  first.  An  exception  may  be 
the  relative  timing  of  laryngeal  and  supralaryngeal  events,  but  this  does  not 
require  detailed  simulation  of  laryngeal  mechanisms.  Similarly,  although  the 
generation  of  frication  is  directly  dependent  on  appropriate  articulatory 
conditions,  its  accurate  modeling  requires  very  precise  timing  and  positioning 
of  the  articulators.  For  these  reasons,  we  rely  on  explicit  control  over  the 
excitation  signal  itself  rather  than  over  the  generative  processes.  The 
perceptual  effects  of  simultaneous  excitatory  and  articulatory  variations  can 
still  be  evaluated  quite  adequately,  despite  these  substitutions  for  aerody- 
namic effects  that  link  excitation  to  articulation  in  real  speech. 

The  considerations  that  led  to  modeling  articulatory  movements  exclusive- 
ly in  terms  of  the  resulting  vocal-tract  shape  in  the  midsagittal  plane  were 
primarily  based  on  observational  limitations.  That  is,  the  model  was  origi- 
nally developed  on  the  basis  of  a systematic  examination  of  a series  of 
midsaggital  X-ray  tracings  of  the  vocal  tract,  in  conjunction  with  the  time- 
synchronized  speech  signal.  By  working  interactively  with  the  model  during 
its  development,  it  could  be  shown  that  displacements  of  the  midsagittal 
vocal-tract  outline  can  be  derived  from  the  movements  of  the  independently 
controlled  articulators. 
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Primary  control  of  the  model  in  terms  of  positions  and  movements  of  the 
principal  articulators  is,  of  course,  an  essential  design  consideration:  this 
mode  of  control  is  conceptually  convenient  for  the  experimenter  and  is  the 
natural  framework  for  the  application  of  structural  and  dynamic  constraints. 
The  articulatory  model  we  will  be  using  builds  on  the  "ball-in-mouth"  model  of 
the  articulators  that  was  introduced  by  Coker  and  Fujimura  (1966),  but  uses  a 
more  nearly  complete  set  of  articulators.  The  parameters  assigned  to  these 
articulators  are  position  variables  that  indicate  the  position  of  the  struc- 
ture in  fixed  space  or  relative  to  some  other  articulatory  structure  to  which 
the  articulator  is  primarily  attached.  For  example,  lip  and  tongue-body 
positions  are  specified  with  respect  to  the  moving  jaw.  This  representation 
allows  an  active  mode  of  movement  when  an  articulator's  own  parameters  are 
changing;  alternatively,  a passive  mode  of  movement  may  be  executed  relative 
to  the  fixed  articulators  as  a result  of  movement  of  the  structure  to  which 
that  articulator  is  attached,  but  relative  to  which  i a position  remains 
unchanged . 

The  model  first  computes  the  midsagittal  outlines  that  result  from  the 
momentary  positions  of  the  articulators  and  then  computes  the  midsagittal 
separations  relative  to  an  essentially  fixed  outer  structure.  Published 
information  is  used  to  convert  these  distances,  measured  at  a large  number  of 
points  along  the  vocal  tract,  to  a continuous  cross-sectional  a-ea  function 
along  a center-line  distance  function  between  the  glottis  and  the  i,ips.  Up  to 
25  area  samples  spaced  0.875  cm  apart  are  now  computed  an  used  in  a 
nonuniform  acoustic  transmission  line  representation.  Appropriate  lumped 
terminations  model  the  larynx  at  one  end  of  the  tract  and  the  lip'-  at  the 
other.  Articulations  accompanied  by  a velar  opening  are  modeled  with  the  aid 
of  an  acoustic  sidebranch  that  parallels  the  nonuniform  transmission  line  for 
the  oral  tract.  The  cross-section  of  this  nasal  branch  is  assumed  to  be  fixed 
except  for  a region  near  the  velum. 

The  Control  and  Display  of  Art iculatory  Synthesis 

The  articulatory  process  will,  of  course,  be  simulated  on  a computer, 
since  digital  simulation  provides  flexibility  and  convenience  that  is  not 
attainable  through  the  use  of  physical  models.  For  the  model  to  be  a truly 
useful  tool,  it  must  be  equipped  with  displays  that  allow  observation  of  the 
consequences  of  input  instructions  at  all  levels  of  the  synthesis  process. 
Further,  to  facilitate  the  hypothesize-and-test  mode  of  experimentation,  the 
model  is  controllable  by  interactive  graphical  editing  at  either  the  level  of 
the  individual  articulators  or  of  vocal— tract  shape;  also,  when  comparison 
with  spoken  utterances  is  desired,  changes  can  be  made  directly  in  the 
spectral  representation.  Finally,  the  model  must  provide  an  acoustic  output 
promptly  on  demand.  Only  on  this  basis  can  the  user  readily  assess  the 
perceptual  consequences  of  the  synthesis  process,  and  only  when  the  synthesiz- 
er responds  promptly  to  changes  in  the  control  parameters  is  it  easy  to 
maintain  a conceptual  link  between  the  hypothesis  bexng  tested  and  the  result 
of  the  test. 

The  control  and  display  facilities  we  will  use  are  best  considered  in 
terms  of  the  functional  modes  in  which  the  model  is  to  be  operated.  At  the 
articulatory  level,  there  is  need  for  convenient  control  in  terms  of  articula- 
tory parameter  values  (for  the  individual  articulators)  and  their  allowed 
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variation.  As  an  aid  in  visualizing  these  numerical  specifications,  the 
corresponding  vocal-tract  outline  (midsagittal ) will  be  displayed.  To  change 
or  improve  the  synthetic  sound,  interactive  graphical  editing  will  allow  the 
user  to  redraw  part  or  all  of  a midsagittal  vocal-tract  outline.  X-ray 
information  can  be  conveniently  introduced  at  this  point.  Generally,  also, 
the  articulatory  parameters  may  be  quickly  determined  from  such  an  outline. 

When  the  model  is  used  in  its  dynamic  mode,  the  specified  articulations 
for  allophones  will  be  supplemented  by  a set  of  rules  that  govern  the  time- 
variation  of  the  articulatory  system.  With  the  specified  articulations  stored 
in  a table  of  parameter  values,  one  for  each  specified  a! lophone,  the  rules 
will  operate  on  the  selected  sets  of  these  parameters  to  yield  continuous 
functions  of  time. 

An  alternative  procedure  to  the  automatic  generation  of  articulatory 
sequences  by  rule — one  that  will  be  especially  useful  in  trying  out  new  ideas 
or  making  detailed  improvements  to  rule-generated  sequences — is  to  redraw 
individual  parameter  "tracks"  on  the  interactive  graphic  display.  The  total 
effect  (on  the  other  parameters  as  well)  can  then  be  seen  in  the  midsaggital 
display  and  heard  as  the  synthetic  speech  output. 


At  the  spectrum  level,  it  will  be  useful  to  view  the  spectral  conse- 
quences of  the  articulatory  movements.  For  voiced  articulations,  it  is 
advantageous  to  view  the  spectral  envelope  without  regard  to  fundamental- 
frequency  variations.  Such  a spectral  envelope  and  the  corresponding  formant 
frequencies  can  be  derived  from  the  model  without  the  need  for  generating  the 
actual  signal  waveform.  Since  the  formant  frequencies  are  the  terms  in  which 
the  acoustic  cues  are  best  known,  this  also  makes  for  easy  comparisons.  When 
the  results,  as  viewed  in  the  above  representations,  are  acceptable,  we  will 
generally  want  to  generate  the  acoustic  signal  itself.  This  is  so  because,  by 
listening  to  the  signal,  we  may  quickly  judge  its  quality  or  naturalness  and 
assess  its  identifiability. 


At  this  point  we  can  make  good  use  of  another  research  tool  we  are  just 
completing:  the  Digital  Pattern  Playback  (Nye,  Reiss,  Cooper,  McGuire, 
Mermelstein  and  Montlick,  1975).  This  device  stores  the  speech  spectrum  in 
computer  core  memory  and  so  can  immediately  display  a conventional  gray-scale 
spectrogram  for  interactive  graphical  editing.  Visual  comparisons  can  then  be 
made  between  the  original  spectrogram  (generated,  in  this  case,  by  articulato- 
ry synthesis)  and  either  (1)  the  same  spectrogram  after  it  has  been  edited  to 
improve  intelligibility,  or  (2)  a spectrogram  of  human  speech  of  the  same 
sentence.  The  Digital  Pattern  Playback  also  provides  for  comparison  b^  ear  of 
the  sounds  that  correspond  to  the  spectrograms.  In  other  ways  too,  the  DPP's 
capabilities  for  display  and  manipulation  of  speech  data  make  it  a useful 
companion  device  to  the  articulatory  synthesizer. 

Ideal  Synthesizers  and  Research  Synthes i zers : Why  They  May  Differ 


Speech  synthesis  based  on  articulatory  models  has,  of  course,  a consider- 
able history.  Some  of  the  major  contributions  to  the  design  and  use  of 
articulatory  synthesizers,  and  to  the  underlying  knowledge  about  relations 
between  articulation  and  sound,  are  listed  under  References.  There  have  been 
varied  reasons  for  building  such  synthesizers;  in  some  cases,  the  reason  was 
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to  demonstrate  that  synthesis  could  be  done  in  a particular  way,  in  some,  to 
mimic  human  production  as  seen  by  X-ray,  and  in  some  to  attempt  the  production 
of  more  natural  speech  than  is  easily  obtainable  from  terminal  analog 
synthesizers  or  to  control  synthesis  at  a lower  bit  rate.  Usually,  some  part 
of  the  effort  has  been  directed  to  getting  natural-sounding  speech,  that  is, 
to  approximating  the  performance  of  an  ideal  synthesizer. 

It  seems  obvious  that  an  articulatory  synthesizer  deserving  the  label 
"ideal"  would  have  a capability  for  mimicking  human  speakers  quite  exactly. 
We  would  rate  its  performance  initially  on  the  naturalness  of  its  "spoken" 
output;  later,  we  would  inquire  about  how  accurately  its  chain  of  transforma- 
tions (from  phonetic  sequence  to  sound)  match  those  of  the  human  speaker. 
Comparisons  would  be  made  at  the  levels  of  vocal-tract  shape  and  acoustic 
spectrum — perhaps,  even  at  the  level  of  the  speech  waveform.  It  hardly  needs 
saying  that  no  existing  synthesizer  comes  near  to  meeting  such  criteria. 

However,  ideal  performance  is  not  necessarily  what  we  most  want  from  a 
research  synthesizer;  that  is,  the  question  of  ideal  performance  needs 
reexamination  when  we  ask,  not  about  the  naturalness  of  the  speech,  but  about 
the  usefulness  of  the  synthesizer  for  research — in  particular  for  searching 
out  the  articulatory  cues.  The  objective  of  this  latter  task  is  to  find  the 
simplest  possible  description  of  articulatory  events  that  will,  despite 
crudities  of  the  speech,  let  a listener  recover  the  phonetic  message. 

If  we  draw  on  our  experience  in  searching  for  the  acoustic  cues,  we  will 
wish  to  manipulate  these  articulatory  "events"  in  a variety  of  ways  to  study 
the  perceptual  consequences  for  the  corresponding  speech  events.  Sometimes 
this  will  involve  efforts  at  simplification,  for  example,  by  allowing  only  the 
tongue  or  the  lips  to  move  in  synthesizing  a syllable  that  is  normally  spoken 
with  some  degree  of  movement  by  most  of  the  articulators.  Again,  experimenta- 
tion will  involve  stepwise  variation  in  the  relative  timing  of  two  component 
gestures,  for  example,  of  tongue  and  lip  movements  in  synthesizing  a syllable 
like  [ ibu]  or  an  initial  consonantal  cluster  such  as  [bJl]  in  [b£ed].  Here 
good  discrimination  of  the  time  delay  between  lip  and  tongue  release  wojld 
imply  a basically  unified  organization  for  the  cluster,  whereas  poor  discr:mi- 
nability  would  indicate  an  independence  of  the  constituent  gestures.  Too  much 
delay  of  the  tongue-lip  release  would  result  in  the  insertion  of  a vowel  in 
the  perceived  phonetic  string  (that  is,  [baJ-ed]  instead  of  [b*-ed]);  of  course, 
this  must  be  avoided,  since  it  would  cue  a phonetic  distinction. 

Obviously,  experimental  manipulations  of  this  kind  do  not  mimic  natural 
speech.  Often  they  call  for  an  independence  of  control  or  a grading  of 
spatial  and  temporal  relationships  that  a human  speaker  could  hardly  achieve. 
To  be  sure,  they  ought  not  violate  physiological  constraints  but,  short  of 
that,  we  will  want  to  put  the  articulators  through  their  paces  in  order  to 
assess  the  perceptual  consequences.  Our  expectation  about  the  resulting 
sounds  is  that  many,  perhaps  most,  will  sound  "strange"  but  that  some,  perhaps 
many,  will  be  clearly  identifiable. 

Thus,  simplicity,  both  conceptual  and  operational,  is  a primary  require- 
ment in  a research  synthesizer.  We  expect  to  employ  the  fewest  independent 
articulators,  and  the  fewest  control  parameters  to  position  and  move  them, 
that  will  still  generate  acceptable  tokens  for  all  the  syllables  of  the 
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language,  that  is,  that  will  generate  all  the  phones  in  the  full  range  of 
phonologically  allowed  contexts. 

Indeed,  it  is  the  essence  of  modeling  to^  try  for  the  maximum  simplicity 
chat  will  still  give  the  required  performance.  In  a search  for  the  cues, 

performance  is  properly  judged  at  the  level  of  intelligibility,  which  is 
different  from,  and  less  demanding  than,  naturalness.  Hence,  naturalness  in 
synthesis  is  for  us  not  a primary  short-term  goal,  nor  was  it  needed  in  our 
earlier  search  for  the  acoustic  cues.  We  found,  in  our  work  with  the  Pattern 
Playback,  that  the  pursuit  of  the  acoustic  cues  could  proceed  in  the  presence 
of  a somewhat  unnatural  speech  quality  that  even  lacked  pitch  inflection. 
Intelligibility  was  the  important  requirement  and  proved  to  be  nearly  orthogo- 
nal to  the  dimension  of  naturalness. 

Departures  from  naturalness  are  not,  of  course,  a virtue,  nor  do  we  wish, 
when  manipulating  the  articulators,  to  depart  unnecessarily  from  the  general 
configurations  we  see  in  X-ray  movies.  The  guidance  that  level-by-level 
comparisons  (of  synthesis  vs.  nature)  can  give  us  is  too  valuable  to  be 
ignored.  Indeed,  we  will  sometimes  want  to  manage  the  articulation  so  as  to 
make  it  come  quite  close  to  the  human  model.  The  problem  in  designing  a 
research  synthesizer  was  to  retain  this  capability,  or  as  much  as  could  be 
had,  without  paying  too  high  a price  in  complexity  of  representation  and 
control . 


SUMMARY 

Our  reasons  for  undertaking  a search  for  the  articulatory  cues  are, 
first,  that  this  will  provide  an  insight  into  the  nature  of  speech  production 
comparable  to  the  view  we  gained  of  speech  perception  when  we  succeeded  in 
finding  many  of  the  major  acoustic  cues;  and  second,  that  the  relationships 
between  cues  and  phonetic  elements  should  be  simpler  and  more  direct  in  the 
articulatory  domain  than  they  proved  to  be  in  the  acoustic  domain. 

The  origins  of  our  interest  in  this  undertaking  lie  in  what  we  think  we 
have  learned  about  the  nature  of  speech  from  two  parallel  lines  of  investiga- 
tion. From  studies  of  how  speech  is  perceived,  we  learned  that,  although  the 
acoustic  signal  contains  a wealth  of  detail,  only  some  of  the  things  one  sees 
in  a spectrogram  are  important  to  the  ear  in  identifying  the  phonetic  content 
of  the  spoken  message.  These  we  have  called  the  acoustic  cues.  Numerous 
other  things  that  can  be  seen  in  the  spectrogram  are  largely  irrelevant,  at 
least  for  intelligibility.  By  ignoring  these  things  and  synthesizing  speech 
from  patterns  that  contained  only  the  acoustic  cues,  we  were  able  to  greatly 
simplify  the  acoustic  signal  and  still  retain  most  of  the  intelligibility. 
When  we  examined  the  nature  of  these  acoustic  cues,  however,  we  found  few  one- 
to-one  correspondences  between  them  and  the  phones  they  i epresented ; raUher, 
the  relationships  were  complex  in  ways  that  pointed  to  a reorganization  and 
overlapping  of  articulatory  gestures  during  speech  production. 

From  physiological  studies  of  how  speech  is  produced,  we  have  learned 
that  articulatory  events  also  seem  complicated;  thus,  articulation,  as  seen  in 
X-ray  motion  pictures  or  electromyographic  recordings,  involves  most  of  the 
articulators  most  of  the  time.  We  can  assume  that  here,  too,  some  limited  set 
of  the  component  gestures  provides  the  critical  information  (by  way  of  sound 


as  intermediary)  on  the  basis  of  which  a listener  identifies  the  phonetic 
content  of  the  message.  These  we  refer  to  as  the  articulatory  cues.  If  our 
assumption  is  correct,  then  much  of  the  total  articulatory  description  is  also 
largely  irrelevant,  at  least  for  intelligibility.  But  interest  in  the 
articulatory  cues  goes  beyond  stripping  away  irrelevanc ies . A more  important 
point  follows  from  our  interpretation  of  the  nature  of  their  counterparts  in 
the  acoustic  domain:  if  the  acoustic  cues  do  indeed  reflect  their  articulato- 

ry origins,  then  the  articulatory  cues  should  show  the  simpler  relationship 
with  phonetic  elements. 

We  think  the  time  is  right  to  undertake  a search  for  the  articulatory 
cues.  There  exists  an  extensive  body  of  knowledge  about  both  perception  and 
production.  There  is  a proven  research  method  and  experience  with  a computer- 
based  articulatory  synthesizer  on  which  to  implement  it.  Thus,  both  a 
significant  problem  and  the  means  to  probe  it  are  at  hand. 
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The  Study  of  Articulatory  Organization:  Some  Negative  Progress* 

Katherine  S.  Harris! 


ABSTRACT 

This  paper  examines  some  of  the  evidence  against  some  commonly- 
held  views  of  the  nature  of  articulatory  units.  Four  hypotheses  are 
examined:  First,  speech  perception,  and  hence  speech  production, 
operates  on  some  invariant  "core  unit"  in  a time-varying  signal. 
Second,  speech  perception  extracts  invariant  units  from  a time- 
varying  signal  on  a phoneme  target  basis.  Third,  unit  targets  are 
supported  by  positioned  feedback.  Fourth,  the  difficulties  with  the 
first  three  formulations  can  be  solved  by  changing  from  "phoneme" 
units  to  some  higher  level  unit.  Reasons  are  found  for  discarding 
all  these  hypotheses. 

INTRODUCTION 

This  paper  is  an  attempt  to  summarize  what  we  now  know,  or  rather  don't 
know,  about  a vaguely  defined  area  called  "the  organization  of  speech."  In 
particular,  the  topic  is  what  MacNeilage  has  called  the  "reality  status  of 
concepts  of  linguistic  units."  (MacNeilage,  1973). 


The  study  of  the  organization  of  speech  is  the  province  of  speech 
science,  a rather  uneasy  blend  of  elements  from  phonetics  and  motor  physiolo- 
gy. Perhaps  I can  illustrate  the  mixture  with  an  anecdote  quoted  from 
Granit's  (1967)  biography  of  the  great  neurophysiologist,  Sherrington.  He  is 
recorded  as  saying  to  his  student,  Wilder  Penfield,  "It  must  be  nice  to  hear 
the  preparation  speak  to  you."  Our  hypotheses  about  speech  organization  came 
partly,  then,  from  phonetics,  and  partly  from  general  neurophysiology.  I 
would  like  to  run  through  four  of  these  hypotheses,  in  their  argument  form, 
and  discuss  some  evidence  that  has  caused  them  to  fail.  The  first  hypothesis 
follows : 
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Hill-Top  Hotel,  Tokyo,  7-10  December,  1976. 
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HYPOTHESIS  1 


Speech  is  perceived  as  having  invariant  units:  therefore,  percep- 
tion must  operate  on  invariant  parts  of  the  acoustic  signal. 

This  hypothesis,  or  versions  of  it,  guided  early  work  at  the  Bell 
Telephone  Laboratories . It  has  two  obvious  problems.  The  first  is  that  the 
acoustic  signal  for  a given  speech  sound  and  for  a given  speaker,  depends  on 
the  size  of  the  vocal  tract.  This  problem  led  Peterson  ( 1952,  1961  ),  and 
later  Gerstman  (1968)  to  suggest  that  the  listener  arrives  at  vowel  judgments 
by  some  kind  of  perceptual  normalization  of  the  presented  vowel,  based  on  the 
relationship  between  formants.  There  are  some  problems  with  this  theory  as  a 
dynamic  hypothesis,  as  we  will  discuss  below,  but  refinements  of  the  theory 
will  account  for  differences  between  steady-state  formant  values  of  the  vowels 
for  different  speakers. 

A second  problem,  and  at  that  time  an  apparently  more  serious  one,  was 
that  when  convenient  visual  displays  for  the  acoustic  speech  signal  became 
widely  known,  no  units  corresponding  to  phonetic  entities  were  obvious 
(Potter,  Kopp  and  Green,  1947).  A suggestion  made  by  them  was  that  perception 
is  organized  to  focus  on  the  relatively  steady-state  aspects  of  the  signal, 
skipping  over  the  variable  "transitional"  stretches  between  those  steady 
states.  Indeed,  Cyril  Harris  (1953),  then  working  at  Bell,  attempted  to 
synthesize  speech  by  putting  together  short  segments  of  speech  clipped  from 
the  ongoing  stream.  The  result  was  unintelligible. 

I don't  believe  we  have  yet  learned  quite  enough  from  that  failure.  Even 
at  the  time,  a different  interpretation  of  the  transitional  portions  was 
available,  namely,  that  these  transitions  were  essential  for  speech  intelligi- 
bility, since  they  could  be  shown  to  have  cue  value,  particularly  for  the 
consonants.  This  interpretation  had,  indeed,  been  demonstrated  directly  in 
work  with  speech  synthesis  (Cooper,  Delattre,  Liberman,  Borst  and  Gerstman, 
1952).  There  were  later  attempts  to  "explain"  the  speech  perceptual  mechanism 
by  more  complicated  hypotheses,  as  we  will  see  below. 

HYPOTHESIS  2 

Speech  is  perceived  in  terras  of  invariant  units:  it  can  be  shown 
that  there  are  few  steady-state  segments  in  speech.  Hence,  speech 
perception  must  process  the  signal  to  extract  invariant  perceptual 
units  from  a time  varying  signal. 

This  is  one  version  of  the  Haskins  "motor  theory,"  stripped  of  its 
phys iological  detail  (Liberman,  Cooper,  Harris  and  MacNeilage,  1963). 
Basically,  the  idea  is  that,  in  production,  there  is  temporal  and  spatial 
smear  of  various  low-level  aspects  of  the  articulatory  process,  so  that  the 
resulting  acoustic  output  is  an  encipherment  of  the  input  signal;  further,  in 
perception,  the  perceptual  apparatus  somehow  decodes  the  signal,  by  reference 
to  articulation,  into  its  underlying  units.  There  are  a number  of  subhy- 
potheses, of  varying  degrees  of  sophistication,  about  what  these  underlying 
units  might  be  (Harris,  1976). 
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Invariant  Electromyographic  Signals 

There  were  some  early  Haskins  attempts  to  show  that  the  signals  to  the 
muscles  were  less  variable  than  the  resulting  acoustic  outputs  (Harris, 
Lysaught  and  Schvey,  1965;  MacNeilage,  1963;  Cooper,  Liberman,  Harris  and 
Grubb,  1958;  Cooper,  1965).  Apart  from  the  difficulty  of  testing  the 
proposition  that  one  type  of  unit  is  less  variable  than  another,  the 
hypothesis  suffers  from  the  fact  that,  as  stated,  it  ignores  the  variations  in 
muscle  signal  size  associated  with  the  different  distances  through  which 
articulators  must  travel,  when  different  phonetic  units  are  juxtaposed.  This 
point  was  discussed  by  MacNeilage  (1970)  who  observed  that  coarticulation 
effects  on  muscle  signals,  due  to  this  effect,  are  ubiquitous. 

Art iculatory  Targets 

The  point  of  view  that  articulatory  movement  "aims  at"  articulatory 
targets,  is  the  view  espoused  by  MacNeilage  in  the  paper  cited  above.  He 
suggests  that  the  targets  are  maintained  by  some  form  of  feedback  from  the 
periphery,  as  does  Abbs  (1973). 

Acoust ic  Targets 

A variant  of  this  view,  advanced  by  Ladefoged  (1967)  and  Lieberman 
(1973),  among  others,  is  that  speakers  aim  at  acoustic  targets,  which  can  be 
realized  by  different  articulatory  maneuvers,  depending  on  context  or  speaker. 

Closely  related  views  have  been  developed  for  somewhat  different  ends  by 
Lindblom  (1963)  and  by  Ohman  (1967).  Lindblom,  in  explaining  vowel  neutrali- 
zation in  rapid  or  destressed  speech,  suggested  that  invariant  signals  are 
sent  to  the  articulators  for  a given  phoneme  target,  but  that  the  target  is 
not  always  attained  because  the  next  signal  may  be  sent  too  soon,  causing 
target  undershoot.  Ohman  (1967),  in  attempting  to  account  for  phonetic 
context  effects,  suggests  that  they  arise  from  the  temporal  overlap  of 
movements  towards  target  positions.  Lindblom  has  developed  a very  similar 
inertial  view  of  speech  timing  effects  (Lindblom,  1967)  to  account  for 
differences  in  the  inherent  duration  of  vowels. 

All  these  models  have  a common  view  of  the  speech  process:  peripheral 

encoding  is  believed  to  account  for  coarticulation  of  signals  which  are 
invariant  at  a central  stage  in  the  articulatory  process.  To  the  extent  that 
these  models  specify  a perceptual  process,  they  assume,  either  explicitly  or 
implicitly,  that  perception  proceeds  by  reversing  the  encoding  operations  of 
production  (Lindblom  and  Studdert-Kennedy , 1967). 

Recent  evidence  suggests  that  this  is  a misleading  picture.  Strange, 
Verbrugge,  Shankweiler  and  Edman  (1976)  presented  listeners  with  sets  of 
natural  vowels,  either  alone  or  in  consonantal  context.  They  found  that 
identification  was  better  when  the  vowels  were  in  context.  If  perception  were 
indeed  a steady-state  target-extracting  process  of  any  kind,  it  would  be  hard 
to  explain  these  results.  When  a listener  is  presented  with  vowels  in  steady- 
state  form,  they  are  presumably  already  "at  target."  When  they  are  presented 
in  CVC  context,  the  vowel  target  must  usually  be  inferred,  due  to  undershoot 
or  similar  adjustments.  Since  no  decoding  is  required  in  the  former  case, 
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listeners  should  be  maximally  accurate.  The  fact  that  they  perform  worse 
calls  into  question  any  of  the  "target-extraction"  formulations. 

HYPOTHESIS  3 

Speech  has  invariant  units.  These  units  are  maintained  in  adult 

speakers  by  some  form  of  nonacoustic  feedback  from  the  periphery. 

Hypotheses  about  the  role  of  feedback  in  speech  have  been  with  us  for 
some  time,  although  the  literature  has  not  always  been  explicit  about  what 
kind,  or  kinds,  of  feedback  is  crucial,  as  between  gamma-loop  feedback  (Abbs, 

1973) ,  or  tactile  and  kinesthetic  feedback  (Ringel  and  Steer,  1963).  However, 
in  spite  of  the  general  importance  of  the  topic,  there  seem  to  be  substantial 
roadblocks  in  the  path  of  finding  out  more  by  the  means  presently  available. 
Three  approaches  have  been  used. 

First,  there  have  been  a number  of  studies  involving  reduction  of  oral 
tactile  sensation,  most  notably  those  of  Ringel  and  his  associates  (for 
example,  Ringel  and  Steer,  1963).  In  general,  these  experiments  show  that  the 
effects  of  block  on  various  branches  of  the  trigeminal  nerve  are  not 
overwhelming  (Scott  and  Ringel,  1971;  Borden,  Harris  and  Catena,  1973). 
Furthermore,  the  experimental  procedure  causes  motor,  as  well  as  sensory 
effects,  so  that  the  results  are  difficult  to  interpret  (Borden,  Harris  and 
Catena,  1973;  Abbs,  Folkins  and  Sivarajan,  1976). 

A second  approach  has  been  to  scan  the  relevant  neurophysiological 
literature,  in  order  to  find  an  appropriate  animal  model  for  the  human  speech 
situation.  While  it  is  difficult  for  an  amateur  to  assess  the  work,  there 
does  not  seem  to  be  an  entirely  appropriate  analog  for  speech,  and  results  are 
conflicting  with  regard  to  the  importance  of  various  types  of  feedback  for 
various  kinds  of  movement  in  the  examples  discussed.  For  example,  animals  can 
use  deafferented  limbs  in  learned  or  unlearned  tasks  (Taub  and  Berman,  1968), 
although  some  deterioration  of  fine  motor  control  is  generally  found.  On  the 
other  hand,  lesions  of  the  tract  of  the  mesencephalic  nucleus,  which  abolishes 
spindle  afferent  input  from  the  masticatory  muscles  (Goodwin  and  Luschei, 

1974)  does  not  alter  chewing  behavior  in  any  obvious  way. 

The  third  approach  has  been  to  study  the  effects  of  disruptions  of 
articulation.  Here  again,  there  is  no  solid  body  of  relevant  experimentation 
and  results  are  often  conflicting.  Folkins  and  Abbs  (1975),  for  example,  have 
shown  that  speakers  can  compensate  immediately  for  the  effects  of  unexpected 
interruptions  of  articulator  movement.  In  their  experiment,  the  jaw  was 
unexpectedly  loaded  during  the  closure  for  a bilabial  stop  consonant.  Results 
show  that  the  lips  compensate  for  the  jaw  in  completing  closure  on  the  first 
trial.  Another  often  cited  study  by  Lindblom  and  Sundberg  (19^3)  reports  that 
a speaker  can  duplicate  his  natural  vowels  with  a bite  block  between  his 
teeth,  with  virtually  no  time  for  relearning.  However,  the  only  citation  of 
the  study  I know  is  an  oral  report,  with  no  experimental  details.  Hamlet  and 
Stone  (1976),  using  a different  experimental  paradigm,  find  compensatory 
effects  over  fairly  substantial  periods. 

If  articulation  suffers  some  interference,  the  speaker  may  use  either 
acoustic  or  nonacoustic  feedback  to  compensate  for  the  disruption.  It  has 
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been  suggested  by  Nooteboom  (1970)  that  compensatory  articulation  may  well  be 
guided  by  acoustic  rather  than  articulatory  equivalence.  If  so,  we  would 
expect  devastating  effects  of  articulatory  disruption  accompanied  by  acoustic 
masking.  So  far  as  I know,  this  line  of  research  is  unexplored.  Surely,  the 
disruption  experimental  paradigm  is  eligible  for  far  more  searching  explora- 
tion than  it  has  thus  far  received. 

HYPOTHESIS  4 

Speech  has  units,  but  we  would  understand  its  organization  better  if 
we  turned  from  phones  to  more  appropriate  units,  such  as: 

Syllables 

The  evidence  against  Kozhevnikov  and  Chistovich's  syllable  based  model  of 
coarticulation  (Kozhevnikov  and  Chistovich,  1965)  is,  in  large  part,  a product 
of  the  industry  of  Kenneth  Moll  and  his  students  (for  example,  Daniloff  and 
Moll,  1968;  McClean,  1973),  although  there  has  been  recent  interesting  and 
important  work  by  Benguerel  (Benguerel  and  Cowan,  1974).  These  studies  all 
show  that  there  is  little  evidence  that  the  syllable  boundary,  as  traditional- 
ly defined,  blocks  coarticulation.  Benguerel  interprets  his  results  as 
supporting  a feature-based  model  of  coarticulation,  such  as  that  of  Henke  (see 
below) . 

Features 


This  is  not  the  place  for  an  exposition  of  the  virtues  of  feature-based 
models  in  general.  However,  some  recent  experiments  argue  against  feature- 
based  models  of  anticipatory  coarticulation.  Tom  Gay,  in  his  paper  at  this 
conference,  will  be  discussing  evidence  that  phonetic  entities  are  separately 
organized  at  the  electromyographic  level,  even  when  there  is  no  reason  for  it 
in  feature  terms.  Perhaps  even  more  important  is  evidence  that  specifically 
contradicts  Henke's  "scan-ahead"  model  for  articulatory  coarticulation  (1967). 
The  model  proposes  that  a given  feature  will  appear  in  the  speech  stream  as 
soon  as  it  can,  by  assimilative  spreading.  Thus,  if  a nasal  is  preceded  by  a 
series  of  vowels  which  are  unspecified  for  nasalization,  they  should  all  be 
equally  nasalized.  This  does  not  happen  (Kent,  Carney  and  Sevareid,  1974; 
Ushijima  and  Hirose,  1974).  The  degree  of  nasality  of  a vowel,  as  measured  by 
velar  height  during  its  production,  depends  on  its  proximity  to  a nasal 
consonant . 

Some  recent  results  of  our  own  can  be  interpreted  in  the  same  way  (Bell- 
Berti  and  Harris,  1976).  We  examined  anticipatory  coart iculat ion  of  lip 
rounding  for  /pasup/,  /patup/,  /patsup/  and  /pastup/,  measured  by  the  electro- 
myographic activity  of  the  orbicularis  oris  muscle.  We  find  that  the  onset  of 
electromyographic  activity  seems  to  precede  the  onset  of  acoustic  activity  for 
the  vowel  by  a fixed  temporal  interval,  rather  than  to  be  locked  to  the 
preceding  phone  or  cluster  of  phones. 

Another  bit  of  evidence  arguing  against  a feature-based  model  of  coarti- 
culation comes  from  an  experiment  on  Swedish  rounded  vowels  (McAllister, 
Lubker  and  Carlson,  1974),  again  using  the  onset  of  orbicularis  activity  as  a 
measure  of  anticipatory  coarticulation  of  lip-rounding.  They  compared  onsets 
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of  a series  of  front  and  back  rounded  vowels  (both  of  which  occur  in  Swedish) 
in  the  frame  / itV/.  If  Henke's  model  were  correct,  lip  rounding  should  begin 
at  the  same  time,  relative  to  the  offset  of  / i / , for  all  vowels,  since  it  is 
the  feature  composition  of  the  preceding  phones  that  determines  the  onset  of 
anticipatory  coarticulation.  Interestingly  enough,  the  onset  of  labial  activ- 
ity is  later  for  the  back  vowels  than  for  the  front  vowels,  so  that  the  lips 
seem  to  "wait  for  the  tongue,"  which  must  move  further  for  back  vowels  than 
for  front  vowels.  In  short,  the  temporal  extent  of  anticipatory  coarticula- 
tion cannot  be  predicted  from  a knowledge  of  the  feature  composition  of  the 
phones  before  the  target. 

One  explanation  of  these  data  is  that  articulatory  gestures  are  pro- 
grammed temporally,  and  not  in  syllable  or  feature  units;  however,  we  have  yet 
to  determine  the  influence  of  stress  and  speaking  rate  on  this  timing.  In 
addition,  we  must  also  examine  the  timing  relationships  between  movements  of 
different  articulators,  since  we  may  find  that  subparts  of  segment  gestures 
preserve  their  timing  relationships. 

Overall,  given  this  rather  negative  review  of  our  progress,  what  can  we 
propose  in  a more  positive  direction?  I can  only  offer  a suggestion  by  my 
colleague,  Michael  Turvey  (Turvey,  Shaw  and  Mace,  1976),  who  points  out  in 
reviewing  recent  Russian  studies  of  locomotion,  that  all  skilled  movements 
have  subparts  which  tend  to  preserve  their  relationships  to  each  other  when 
the  movement  is  transformed  as  by  more  rapid  execution.  He  gives  as  an 
example  the  observation,  by  Kent,  Carney  and  Sevareid  (1974),  that  velar 
lowering  and  raising  in  the  word  contract  is  tied  to  particular  events  in  the 
sequence  of  tongue  movements.  Whether  this  particular  example  suggests  a 
useful  experimental  paradigm  or  not,  it  emphasizes  a kind  of  observation  we 
have  been  neglecting  in  studies  of  speech  production:  that  is,  what  relation- 
ships between  articulatory  events  are  preserved  when  context  changes,  whether 
by  increased  speaking  rate  or  stress,  or  segmental  environment?  Furthermore, 
how  do  the  perceptual  consequences  of  this  view  differ  from  those  of  a target 
extraction  approach?  Perhaps,  when  we  can  formulate  experimental  questions  in 
terms  such  as  these,  we  will  be  able  to  make  progress  in  understanding  speech 
organizat ion . 
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Phonetic  Perception* 

Alvin  M.  Liberman^  and  Michael  Studdert-Kennedy 


INTRODUCTION 

To  include  a chapter  on  phonetic  perception  in  a handbook  like  this  is  to 
assume  that  the  process  is  not  wholly  accounted  for  by  such  principles  as  we 
might  find  in  research  on  the  perception  of  nonspeech  sounds.  It  is 
appropriate,  then,  that  we  here  offer  support  for  that  assumption.  We  will 
not  examine  all  relevant  considerations,  only  those  that  bear  most  directly  on 
the  relation  between  the  information  in  the  acoustic  signal  and  the  listener's 
perceptual  response  to  it;  in  our  view,  those  are  the  most  pertinent.  Nor 
will  we  analyze  such  arguments  as  there  are  for  the  opposite  assumption — 
namely,  that  auditory  mechanisms  are  sufficient — though  we  will,  as  is  proper, 
refer  the  reader  to  relevant  papers. 1 

Phonetic  perception  is  what  happens  when,  on  hearing  speech,  a listener 
recovers  the  phonetic  message.  That  message  consists  of  the  meaningless 
segments  we  perceive  as  consonants  and  vowels.  These  are  ordered  in  strings, 
organized  into  larger  units,  and  carried  on  a prosodic  contour.  The  segments, 
both  consonants  and  vowels,  are  called  'phones';  among  the  larger  units  are 
syllables;  the  relevant  aspects  of  prosody  are  stress  and  intonation.  We  must 
distinguish  between  the  perceived  phones  and  the  more  abstract  phonologic 
• forms  that  underlie  them.  Thus,  the  final  segments  in  'cats'  and  'dogs'  are 

different  phones — voiceless  [s]  in  'cats'  and  voiced  [z]  in  'dogs' — yet  at  a 
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more  abstract  phonologic  (or  mo rpho phonemic ) level  they  are  the  same.  Our 
concern  will  be  with  the  less  abstract  phones  and  their  relation  to  the  still 
less  abstract  sounds.  Also,  to  keep  our  task  within  bounds,  we  will  deal  only 
with  the  segmental  aspects  of  phonetic  structure,  including  the  organization 
of  phones  into  syllables,  though  perception  of  prosody  presents  interesting, 
perhaps  even  similar,  problems. 

Students  of  language  commonly  assume  a complex,  grammatical  relation 
between  meaning  and  its  phonetic  vehicle,  but  often  disregard  the  further 
complications  that  arise  in  the  conversion  to  sound.  They  tend  rather  to 
suppose  that  the  phonetic  segments  (or  their  constituent  features)  are 
represented  discretely  in  the  signal,  as  if  by  an  acoustic  alphabet.  If  that 
were  so,  perceiving  phones  would  be  like  perceiving  any  other  sounds;  there 
would  be  no  special  problem  of  phonetic  perception  and  no  reason  for  this 
chapter.  There  is  evidence,  however,  that  the  sounds  of  speech  are  not  an 
alphabet  on  th*  phones,  but  a complex  and  grammatical  code.  In  the  first 
section  of  the  paper  we  will  place  that  code  in  the  larger  scheme  of  language 
and  identify  its  important  characteristics. 

If  it  is  true  that  the  phones  are  linked  to  the  sounds  by  a special  code, 
we  should  suppose  that  extracting  the  phones  from  the  sounds  would  require  a 
correspondingly  special  decoder.  In  the  second  section  we  will  give  reasons 
for  supposing  that  such  a decoder  may  exist. 

There  is,  of  course,  an  alternative  to  grappling  with  the  problems 
created  by  the  peculiar  relation  between  sound  and  phonetic  message:  we  can 
try  to  evade  them.  Indeed,  we  might  suppose  that  phonetic  perception  does  not 
occur,  that  the  segments  of  the  phonetic  level  are  mere  fictions,  invented  by 
linguists  for  their  convenience,  with  no  functional  significance  in  language 
or  in  its  psychophysiology.  On  that  view  the  listener  would  go  directly  from 
sound  to  some  meaningful  segment  (for  example,  word),  bypassing  the  phonetic 
and  phonologic  structure  entirely.  To  justify  our  concern  with  phonetic 
perception,  we  will,  in  the  third  section,  argue  that  phonetic  (and  phonolo- 
gic) structure  plays  an  important  role  in  language  and  is,  in  fact,  recovered 
by  the  listener  when  he  comprehends  what  is  said  to  him. 

THE  SPECIAL  NATURE  OF  THE  SPEECH  CODE:  FUNCTION , FORM,  AND  KEY 

For  anyone  who  would  understand  the  perception  of  speech,  the  salient 
fact  is  that  the  perceived  phones  are  related  to  the  sounds  by  a peculiar 
grammatical  code,  one  of  several  that  link  sound  to  meaning.  To  grasp  the 
nature  of  that  code,  it  is  useful  to  view  it  as  part  of  the  larger  grammar. 
(See,  for  example,  Mattingly  and  Liberman,  1969;  Liberman,  1970).  For  that 
purpose,  we  will  divide  grammar — and  language — in  two.  Making  the  cut  at  the 
phonetic  level,  we  will  look  first  toward  meaning  and  then,  in  the  other 
direction,  toward  sound. 

The  Function  of  the  Meaningless  Phones  and  of  the  Grammatical  Codes  that  Link 
Them  to  the  Meaningful  Message 

To  see  what  grammar  accomplishes,  and  thus  to  appreciate  the  role  of  the 
meaningless  phones,  we  should  first  consider  the  shortcomings  of  an  agrammatic 
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mode  of  communication.  (See  Liberman,  Mattingly,  Turvey,  1972;  Liberman,  in 
press).  In  that  mode,  there  would  be  a straightforward  connection  between 
message  and  signal.  Instead  of  grammatical  rules  like  those  that  build  longer 
and  more  complex  structures  (syllables  and  sentences,  for  example)  out  of 
shorter  and  simpler  ones  (phones  and  words),  there  would  be  only  a list  of  all 
possible  messages  and  their  corresponding  signals.  Obviously,  such  a mode 
would  work  well  enough  if  there  were  reasonable  agreement  in  number  between 
messages  and  signals.  But,  just  as  obviously,  there  is  no  such  agreement: 
the  number  of  messages  we  have  to  send  is  vastly  greater  than  the  number  of 
holistically  different  signals  we  can  efficiently  produce  and  perceive, 
especially  if  we  are  committed  to  signaling  with  sound.  In  short,  an 
agrammatic  mode  of  communication  would  limit  the  number  of  possible  messages 
to  the  small  number  of  distinctively  different  sounds  we  can  produce  and 
perceive.  The  consequence  would  be  that  most  of  what  we  want  to  express  with 
language  would  be  inexpressible. 

We  should  suppose,  then,  that  one  function  of  grammatical  codes  is  to 

restructure  the  information  in  the  messages  so  as  to  make  it  compatible  with 
our  sound-signaling  ability,  and  thus  to  match  the  potentialities  of  the 

message-generating  intellect  to  the  limitations  of  the  vocal  tract  and  the 
ear.  But  why  two  grammars,  syntax  and  phonology,  and  why  the  two  kinds  of 
segments,  meaningful  and  meaningless,  they  govern?  What  is  the  function  of 
this  dual  structure,  characteristic  of  all  languages,  and  especially  of  the 
meaningless,  phonologic  portion  that  concerns  us  in  this  chapter?  Why  not,  in 
a simpler  world,  have  only  a syntax — rules  that  organize  and  reorganize 
segments  (words,  for  example)  that  are  meaningful?  Such  a language  could, 
from  a logical  point  of  view,  evade  the  limitations  imposed  by  the  paucity  of 
different  segments,  since  it  would  be  possible,  even  with  a small  set,  to 

construct  an  infinitude  of  messages.  A phonology- free  language  would,  of 

course,  have  to  make  do  with  a small  vocabulary,  but  that  is  not,  in  logic,  a 
devastating  limitation:  For  to  the  extent  that  we  can  organize  our  semantic 

space  by  a hierarchy  of  features,  a small  vocabulary  might  nevertheless 
suffice  for  many  of  the  things  we  want  to  talk  about  (Ogden,  1967).  But 

specifying  a particular  thing  would,  at  best,  take  a lot  of  talking  and 
listening,  given  the  properties  of  vocal  tracts  and  ears,  and  it  would 
require,  in  addition,  that  one's  mind  work  in  ways  that  may  be  uncongenial  to 
it . 

At  all  events,  no  language  does  get  along  with  a very  small  vocabulary. 
Vocabularies  tend  to  be  large  and  to  grow  ever  larger.  (But  see  Klima,  1975, 
pp . 247-270).  To  achieve  these  large  vocabularies,  given  the  limited  number 
of  signals  we  can  command,  languages  use  a very  few  meaningless  segments — two 
to  three  dozen,  in  most  cases — to  construct  a large  number  of  meaningful  ones. 
Hence,  phonology.  Taken  together,  then,  syntax  and  phonology  serve  as  a kind 
of  interface,  joining  an  intellect,  which  initiates,  comprehends,  and  stores 
messages,  to  a vocal  tract  and  ear,  which  produce  and  receive  the  sounds  by 
which  those  messages  are  conveyed  (Mattingly,  1972;  Liberman,  1974). 
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The  Function  of  the  (Grammatical)  Speech  Code  that  Links  Phonet ic  Message  to 
Sound 

Perhaps  the  need  for  grammatical  recoding  has  ended  with  the  production 
of  the  phonetic  message.  If  so,  the  final  link  to  speech  could  be  agrammatic — 
a unit  of  sound  for  each  segment  of  the  message — and  thus  of  no  special 
interest  to  either  the  linguist  or  psychologist.  But  the  phonetic  message  is 
only  a stage  in  the  grammatical  process  that  connects  meaning  to  sound. 
Further  and  still  quite  drastic  restructuring  is  necessary.  To  see  why,  we 
need  only  pit  the  most  obvious  requirements  of  phonetic  communication  against 
the  capabilities  of  the  ear  and  the  vocal  tract.  Although  that  has  been  done 
in  earlier  publications  (Liberman,  Cooper,  Shankweiler,  and  St  udder t-Kennedy , 
1967;  Liberman,  et  al . , 1972  ; Studdert-Kennedy , in  press),  we  ought  neverthe- 
less to  offer  a brief  review  here. 

Two  requirements  of  phonetic  communication  are  of  special  interest:  the 

ph ones  must  be  communicated  at  a high  rate,  and  their  order  must  be  properly 
apprehended  by  the  listener.  With  regard  to  rate,  it  is  obvious  enough  that 
language  is  more  efficient  the  more  rapidly  it  is  communicated.  It  is  only 
slightly  less  obvious  that  language  is  hard  to  understand  when  it  is 
communicated  too  slowly.  Slow  communication  can  create  difficulties  because 
the  meaning  of  the  longer  segments  is  distributed  in  complicated  ways  among 
the  shorter  segments  they  comprise.  Hence  full  comprehension  of  a sentence, 
for  example,  must  wait  on  completion  of  a structure  that  is  formed  by  the 
words.  The  requirement  about  order  follows  from  the  use  of  a small  number  of 
phonetic  segments.  If  we  are  to  keep  the  number  of  segments  per  word  within 
bounds,  we  must  respect  order:  a word  like  'dam'  must  be  distinguished  from 

its  mirror  image,  'mad.1 

It  is  plain  that  if  the  phonetic  segments  were  transmitted 
agrammat ically — that  is,  each  phone  by  a discrete  segment  of  sound — the 
requirements  of  phonetic  communication  would  not  be  met.  We  could  neither 
speak  nor  listen  as  fast  as  we  need  to — and,  indeed,  do — nor  could  the 
listener  keep  the  segments  in  their  proper  order.  Speaking  rates  vary 
considerably,  but  they  reach  20  or  25  phones  per  second,  at  least  for  short 

stretches.  Presumably,  it  would  be  impossible  to  speak  that  rapidly  if,  as  in 

an  agrammatic  mode,  the  gestures  were  made  discretely,  one  for  every  phone  and 

each  in  its  turn.  And  even  if  the  speaker  could  articulate  that  fast,  the 

listener  could  not  resolve  the  sound  segments  that  would  result;  at  20  or  25 
acoustic  segments  per  second,  the  units  of  sound  (hence  phones)  would  merge  to 
produce,  in  perception,  a buzz  or  pitch.  Moreover,  the  listener  would  have 
difficulty  identifying  the  order  of  such  discrete  sound  units,  even  at  rates 
low  enough  to  permit  him  to  resolve  them.  Given  the  results  of  research  on 
nonspeech  sounds  (Warren,  1969;  1976b),  we  should  suppose  that  he  could 

distinguish  permutations  of  segments,  but  only  on  the  basis  of  overall 
differences  in  the  perceived  pattern,  not  by  assigning  each  segment  to  its  own 
place  in  a sequence.  Surely,  then,  the  grammatical  restructuring  that  makes 
communication  distinctively  linguistic  cannot  end  with  the  production  of  the 
phonetic  message.  At  least  one  more  grammatical  conversion  is  necessary  if 
the  message  is  to  be  transmitted  and  perceived  efficiently. 
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The  Form  o £ the  (Grammat  ical ) Speech  Code  t.hat  Links  Phonet  ic  Message  t o 
Sound ; the  Fit  of  Form  to  Function 


In  the  conversion  of  abstract  phones  to  concrete  sounds  there  is  a 
restructuring  of  information,  designed  as  if  to  match  the  requirements  of 
phonetic  communication  to  the  properties  of  the  vocal  tract  and  the  ear. 
Though  much  that  is  important  about  this  conversion  remains  to  be  learned, 
enough  is  known  to  enable  us  to  see  some  of  its  important  characteristics. 
Thus,  we  know  that  the  segments  are  first  broken  down  into  something  like  the 
well-known  articulatory  features  of  place  of  production,  manner  of  production, 
and  voicing^.  (For  an  explication,  see,  for  example,  Ladefoged,  1971).  As 
speech  is  produced,  those  separate  features  are  assigned  to  the  appropriate 
and  more-or-less  independent  parts  of  the  articulatory  apparatus;  the  compo- 
nent gestures  made  by  those  parts  are  organized  into  preplanned  coding  units 
longer  than  a phonetic  segment:  and  the  organized  complex  of  gestures, 
representing  features  of  each  of  several  successive  phonetic  segments,  is 
produced  simultaneously  or  with  considerable  overlap.  The  result  is  that  the 
coding  unit — roughly  a syllable  in  many  cases — comprises  segments  whose 
component  gestures  (features)  are  thoroughly  interleaved  (Cooper,  Delattre, 
Liberman,  Borst,  and  Gerstman,  1952;  Fant , 1962;  Liberman  et  al.,  1967  ; 

Cooper,  1972;  Stevens  and  House,  1972;  Studdert-Kennedy , 1975a).  We  will  call 
that  arrangement  by  its  common  name,  coarticulation. 


Coarticulation  enables  a speaker  to  produce  phonetic  segments  at  rates 
considerably  higher  than  the  rates  at  which  he  must  change  the  states  of  his 
articulatory  muscles  (Cooper,  1972).  Thus,  he  speaks  faster  than  he  could  if 
each  phonetic  segment  were  represented  by  a unit  gesture,  produced  in  its 
proper  turn  as  one  of  a sequence  of  gestures.  But  coarticulation  has 
consequences  for  perception  as  well,  enabling  the  listener  to  evade  just  those 
limitations  of  the  auditory  system  we  referred  to  earlier.  Consider,  again, 
that  if  the  phonetic  message  were  transmitted  agrammat ically — that  is,  one 
acoustic  segment  for  each  phonetic  segment — then  the  temporal  resolving  power 
of  the  ear  would  make  it  impossible  to  perceive  speech  at  the  rates  that  we 
do,  in  fact,  commonly  attain.  But,  as  we  have  seen,  the  relation  between 
phonetic  message  and  sound  is  not  agrammatic  in  that  sense.  Rather,  coarticu- 
lation effectively  folds  information  about  several  successive  phonetic  seg- 
ments into  a single  stretch  of  sound.  Moreover,  the  overlapped  activity  of 
several  different  articulators — for  example,  lips  and  tongue — will  often 
affect  the  same  parameter  of  the  sound — for  example,  the  second  formant. 3 At 


^In  the  case  of  the  consonants,  place  of  production  refers  to  where  in  the 
mouth — lips,  alveolar  ridge,  or  velum,  for  example — the  consonant  constric- 
tion is  made;  manner  of  production  refers  to  an  articulatory  maneuver — velum 
closed  or  open,  tract  totally  closed  or  only  partly  closed,  for  example — that 
is  characteristic  of  phones  with  the  same  place  of  production;  voicing 
distinguishes  classes  of  phones  having  the  same  place  and  manner  according  to 
the  state  of  the  vocal  cords — open  or  closed — at  the  beginning  of  the 
gesture . 

<1  # # 

JA  formant  is  a peak  in  the  resonance  curve  of  the  vocal  tract.  The  center 
value  of  this  peak,  specified  in  Hz,  is  called  the  formant  frequency. 
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any  chosen  instant  of  time,  therefore,  each  acoustic  parameter  is  (commonly) 
carrying  information  about  more  than  one  phonetic  segment.  [For  fuller 
discussion,  see  Liberman,  et  al.,  (1967).]  That  being  so,  the  limit  on  rate  of 
phonetic  perception  caused  by  the  temporal  resolving  power  of  the  ear  is  no 
longer  set  by  the  number  of  phonetic  segments  transmitted  per  unit  time,  but 
by  the  considerably  smaller  number  of  acoustic  segments  into  which  those 
phonetic  segments  have  been  encoded.  Just  how  much  saving  is  effected  in  this 
manner  depends,  of  course,  on  the  size  of  the  encoding  unit;  and  that  will 
surely  vary  according  to  the  nature  of  the  contiguous  phones  in  the  string, 
rate  of  articulation,  and  other  factors  that  we  only  dimly  understand.  But  a 
significant  amount  of  encoding  will  almost  always  occur — most  obviously  within 
the  syllable — and  it  will,  at  every  rate  of  articulation,  effectively  reduce 
the  number  of  discrete  acoustic  segments  that  must  be  perceived. 


Consider,  now  again,  the  difficulty  the  auditory  system  Would  have  in 
identifying  the  order  of  phonetic  segments  at  even  moderate  rates  Of  speech  if 
each  phonetic  segment  were  represented  by  an  acoustic  segment.  But  inasmuch 
as  the  phonetic  segments  are  not  so  represented,  the  problem  of  identifying 
order  of  discrete  acoustic  segments  does  not  arise  (Day,  1970;  Liberman,  et 
al.,  1972;  Cole  and  Scott,  1974;  Dorman,  Cutting,  Raphael,  1975).  Recall  how 
successive  phonetic  segments  are  encoded  into  the  same  stretch  of  sound,  and 
imagine,  for  example,  simple  cases  like  [ba]  and  [ ab ] . If  these  syllables  are 
produced  at  moderately  rapid  rates  of  articulation,  it  will  be  true  of  both 
acoustic  patterns  that  information  about  the  consonant  and  the  vowel  is 
carried  simultaneously  from  the  beginning  of  the  sound  to  its  end.  But  given, 
that  the  articulatory  gestures  have  opposite  directions  in  the  two  cases — from 
closed  (consonant)  to  open  (vowel)  for  [ba]  and  from  open  (vowel)  to  closed 
(consonant)  in  [ab]  — the  acoustic  shapes  of  the  two  acoustic  syllables  will  be 
different.  Indeed,  they  will  be  mirror  images:  for  [ ba]  the  formants  will  be 
rising  throughout;  for  [ab]  they  will  be  falling.  Thus,  information  about  the 
order  of  phonetic  segments  is  present  in  the  acoustic  signal,  not  as  discrete 
events  in  ordered  sequence,  but  as  variations  in  shape  or  form  (Liberman, 
1976). 


The  Key  to  the  Speech  Code 

Suppose  the  speech  code  were  entirely  arbitrary.  In  that  case,  a 
perceiving  device  could  only  match  the  signal  against  a dictionary  of  auditory 
templates,  just  as  if  it  were  using  a code  book.  Of  course,  the  templates 
could  not  correspond  to  segments  the  size  of  phones  but  would,  rather,  have  to 
be  at  least  as  large  as  the  coding  unit  that  encompasses  the  acoustic 
consequences  of  coarticulation.  As  we  remarked  earlier,  we  do  not  know 
exactly  how  large  that  unit  is  or  how  stable  it  might  be  in  the  face  of 
variations  in  speaking  rate,  word  and  phrasal  stress,  and  other  conditions  of 
articulation.  We  can  only  suppose  that,  at  the  smallest,  the  unit  would  have 
to  be  of  approximately  syllabic  size,  since  there  is  normally  so  much 
coarticulation  within  syllable  boundaries. 

But  the  speech  code  is  not  arbitrary;  there  is  a key  that  unlocks  it.  To 
see  the  nature  of  the  key,  and  how  it  makes  sense  of  the  relation  between 
message  and  signal,  we  need  only  remind  ourselves  that  the  peculiarities  of 
the  speech  code  are  just  those  that  are  introduced  by  the  speaker  as  he  lends 
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himself  to  the  processes  by  which  the  message  is  encoded  in  the  sound.  When 
those  processes  are  understood,  their  consequences  can  hardly  appear  arbitra- 
ry. Thus,  the  key  to  the  code  is  in  the  manner  of  its  production.  We  should 
remark  parenthetically  that  in  this  respect,  speech  is  like  the  rest  of 
language  and  different  from  most  other  processes:  all  the  complications  of 

language  that  the  hearer  must  cope  with  are  only  those  that,  as  speaker,  he 
'knows'  how  to  introduce;  the  complications  of  nonlinguist ic  perception,  on 
the  other  hand,  are  typically  not  owing  to  the  hearer  (or  viewer)  but  are, 
rather,  external  to  him.  At  all  events,  the  processes  by  which  speech  is 
produced  make  it  possible  to  understand  the  relation  between  acoustic  signal 
and  phonetic  message,  however  peculiar  that  relation  might  be. 

Although  knowing  how  speech  is  produced  enables  us  to  see  why  the 
complications  of  the  code  should  be  peculiar  in  the  way  they  are,  it  does  not 
provide  an  automatic  decoding  procedure.  Thus,  we  now  understand  enough  about 
the  speech  code  to  be  able  to  synthesize  speech  by  rule  (Ingemann,  1957; 
Liberman,  Ingemann,  Lisker,  Delattre,  and  Cooper,  1959;  Kelly  and  Gerstman, 
1961;  Kelly  and  Lochbaum,  1962;  Cooper,  1962;  for  a summary,  see  Mattingly, 
1974).  That  is,  we  can  build  a mechanism  that  accepts  as  input  a string  of 
phonetic  symbols  and  then,  as  output,  delivers  speech.  Using  rules  for  the 
conversion  that  can  be  either  acoustic  or  articulatory,  the  synthesizer 
produces  speech  that  is  imperfect — reflecting  our  imperfect  command  of  the 
code — but  rather  highly  intelligible,  nevertheless,  and  reasonably  acceptable. 
Now  if  we  could  simply  turn  those  rules  around,  we  should  have  a working  model 
for  speech  perception.  Unfortunately,  the  rules  for  synthesis,  like  all 
grammatical  rules,  work  in  only  one  direction,  downhill;  they  take  us  from 
message  to  signal  but  not  the  other  way.  Perhaps  there  are  rules  that  go  in 
either  direction,  but  they  have  not  yet  been  found.  Thus,  to  suggest  that  a 
listener  might  use  the  rules  as  a key,  is  only  to  imply  some  kind  of 
connection  between  perception  and  production,  of  which  more  later;  the 
underlying  mechanism  is,  at  present,  unknown. 


THE  SPECIAL  PROCESSES  OF  PHONETIC  PERCEPTION 

Surely  the  most  parsimonious  way  to  account  for  phonetic  perception  is  to 
invoke  only  those  mechanisms  that  are  more  or  less  common  to  mammalian  (or 
primate)  auditory  systems.  (See  Miller,  in  press!.  Can  we  suppose,  then, 
that  such  processes  are  sufficient,  or  must  we  look  to  specializations  of 
various  kinds?  If  specializations  do  exist,  are  they  in  the  form  of  auditory 
devices  that  are  tuned  to  respond  to  the  phonetically  relevant  parts  of  the 
speech  signal?  Or  are  they  more  accurately  characterized  as  integral  parts  of 
a system,  more  linguistic  than  auditory,  that  is  specialized  to  deal  with  the 
peculiarities  of  grammatical  codes?  In  this  section  we  will  consider  whether 
both  such  specializations  might  exist,  the  one  to  deal  with  the  purely 
acoustic  characterist ics  of  the  perceptually  important  parts  of  the  signal, 
the  other  to  cope  with  the  grammatical  code  that  relates  the  signal  to  the 
phonetic  information  it  conveys. 
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Aud itory  Special izat ions  for  Extracting  the  Phonetically  Relevant  Informat ion 
from  the  Speech  Signal 

Many  important  attributes  of  the  speech  signal,  including  some  that  carry 
a heavy  load  of  phonetic  information,  are  not  physically  salient.  For 

example,  although  most  of  the  linguistically  important  information  is  con- 
tained in  the  lowest  three  formants,  the  acoustic  energy  is  not  tightly 
concentrated  there  but  is,  rather,  smeared  diffusely  over  the  entire  speech 
spectrum.  Or  again,  despite  the  fact  that  consonants  carry  a far  heavier  load 
of  segmental  phonetic  information  than  do  vowels,  they  are  signaled  by  far 
less  acoustically  prominent  portions  of  the  spoken  syllable.  Thus,  formant 
frequency  shifts  (transitions)  that  carry  important,  even  essential,  informa- 
tion about  consonantal  place  of  articulation  often  make  excursions  of  hundreds 
of  cycles  in  some  30  or  40  msecs.  Since  humans  seem  to  have  no  difficulty  in 
extracting  that  information,  one  is  led  to  wonder  whether  there  may  not  be 
devices  in  the  auditory  system  specialized  for  that  purpose.  These  devices 
would  be  analogous,  perhaps,  to  the  feature  detectors  found  in  other  species. 

have  in  mind  the  example  of  the  cat,  in  which  Whitfield  and  Evans 
(1965  cound  single  cells  ("miaow"  cells)  responsive  to  the  rate  and  direction 
of  frequency  change.  Whitfield  (1965)  pointed  out  the  possible  relevance  of 
this  finding  to  the  perception  of  formant  transitions  in  speech  when  he 
suggested  that  such  units  might  be  "...a  final  link  in  the  mechanism. .. by 
which  speech-like  and  similar  signals  are  processed"  (p.  247).  If  Whitfield 
is  correct,  we  would  have,  not  an  auditory  specialization  for  language,  but 
rather  a general  auditory  device  (perhaps  typical  of  mammals)  that  is 
exploited  by  humans  for  linguistic  purposes. 


In  fact,  an  auditory  mechanism  specialized  for  language  may  be  difficult 
to  demonstrate,  since  we  obviously  cannot  apply  to  humans  the  electrophysio- 
logical  techniques  that  have  been  used  on  animals.  It  may,  however,  be 
possible  to  approach  the  matter  indirectly,  as,  for  example,  by  extending  to 
speech  the  adaptation  procedures  originally  developed  in  studies  of  vision. 
The  first  to  do  this  were  Eimas  and  Corbit  ( 1973).  With  synthetic  syllables 
(for  example,  [ba]  vs.  [pa])  that  ranged  along  an  acoustic  continuum,  these 
investigators  used  the  techniques  of  adaptation  to  produce  shifts  in  the 
position  of  the  perceptual  boundary.  The  results  led  them  to  speculate  that 
their  procedures  had  affected  a pair  of  binary  phonetic  feature  detectors,  and 
that  adaptation  or  fatigue  of  one  detector  functionally  sensitized  its 
opponent.  Subsequent  work  (see  Cooper,  1975,  and  Ades,  1976,  for  reviews) 
demonstrated  analogous  effects  for  other  consonantal  feature  oppositions.  If 
these  effects  were  truly  on  phonetic  features,  they  would  only  provide 
additional  evidence  for  the  'reality'  of  such  entities  and  offer  still  another 
method,  though  potentially  a most  useful  one,  for  defining  their  boundaries. 


More  relevant  to  our  concerns  here,  therefore,  are  adaptation  studies 
like  the  one  by  Bailey  (1973),  which  showed  that  the  effect  decreased  with  a 
decrease  in  spectral  overlap  between  adapting  and  test  syllables.  This 
suggests  that  if  feature  analyzing  systems  were  indeed  being  isolated,  the 
features  were  auditory  rather  than  phonetic.  (For  a relevant  discussion,  see 
Ades,  in  press.)  The  finding  by  Bailey  assumes  considerable  importance  from 
our  point  of  view,  because  there  is  apparently  no  other  kind  of  evidence  for 
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the  existence  of  feature  analyzing  systems  of  an  auditory  sort. 
Unfortunately,  the  matter  appears  not  to  be  that  simple.  Further  investiga- 
tion has  shown  that  the  degree  of  adaptation  is  contingent  on  so  many  other 
aspects  of  the  synthetic  continuum,  including  intensity  (Ganong,  1975)  and 
fundamental  frequency  (Ades,  in  press),  that  one  may,  in  the  end,  be  led  to 
doubt  the  feature  interpretation  altogether.  Perhaps,  then,  the  achievement 
of  the  work  on  selective  adaptation  will  have  been  to  demonstrate  the 
operation  of  distinct  perceptual  channels  rather  than  the  existence  of  feature 
detectors  as  such.  Nevertheless,  the  investigators  may  have  found  a method 
for  exposing  processes  that  respond  to  linguistically  significant  parts  of  the 
speech  signal,  and  thus  to  have  made  possible  the  discovery  of  auditory 
specializations  for  language. 

Linguistic  Spec ial izat ion  for  Recovering  the  Phonetic  Message 

Even  if  auditory  detectors  of  the  kind  just  discussed  do  exist,  they 
could  do  no  more  than  extract  from  the  acoustic  signal  those  features  that  are 
phonetically  relevant.  They  might  thus  solve  problems  created  by  the  fact 
that  speech  is,  in  certain  respects,  a poor  signal,  but  it  would  presumably 
remain  to  some  other  device,  more  phonetic  than  auditory,  to  deal  with  the 
different  fact  that  speech  is  a special  code.  As  we  were  at  pains  to  point 
out  earlier,  the  peculiar  characteristics  of  the  code  arise  from  the  way 
speech  is  produced,  in  particular,  from  coarticulation.  We  should  suppose, 
then,  that  the  distinguishing  characterist ic  of  the  phonetic  device  would  be 
that  it  somehow  makes  use  of  that  circumstance  (Cooper  et  al . , 1952;  Liberman, 
Delattre  and  Cooper,  1952).  For  the  present,  the  emphasis  should  be  on  the 
word  "somehow";  we  do  not  wish  to  speculate  about  the  underlying  mechanism,  if 
only  because  we  cannot  offer  relevant  data.  But  if  there  is  a device  that 
behaves,  by  whatever  means,  as  if  it  'understood'  how  speech  is  produced,  then 
we  should  expect  to  find  evidence  for  a link  between  perception  and  produc- 
tion. Indeed,  it  would  be  just  such  a linkage  that  would  clearly  characterize 
phonetic  as  against  auditory  perception  ( Studder t-Kennedy , 1976;  Liberman  and 
Pisoni  , in  press) . 

In  the  sections  that  follow,  we  will  identify  several  kinds  of  support 
for  the  assumption  that  there  is  a phonetic  perceiving  device  and,  correspond- 
ingly, a phonetic  mode  of  perception.  Some  of  that  support  is  indirect  in 
that  it  depends  on  our  inability  to  account  for  certain  phenomena  of  speech 
perception  in  terms  of  what  we  now  know  of  how  the  ear  works  and  what  it 
commonly  does;  but  some  is  more  direct,  being  based  on  putative  differences 
between  auditory  and  phonetic  perception  and,  in  some  cases,  on  the  apparent 
links  to  production  that  characterize  the  phonetic  mode. 

Coping  with  the  segmentat ion . If  there  were  an  acoustic  criterion  that 
could  directly  divide  the  speech  stream  into  segments  corresponding  in  size  to 
the  phones,  then  we  should  see  no  need  to  invoke  other-than-aud itory 
processes.  No  matter  how  complex  in  structure  the  acoustic  segments  might  be, 
we  should  suppose  that  correspondingly  complex  auditory  processes  would  be 
equal  to  the  job.  As  we  have  seen,  however,  one  of  the  characteristics  of  the 
speech  code  is  that  the  phonetic  information  is  distributed  in  curious  ways 
through  the  sound.  This  is  the  most  striking  disparity  between  acoustic 
signal  and  phonetic  message  and,  from  the  standpoint  of  a perceiving  device, 
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the  most  troublesome.  Indeed,  the  disparity  is  greater  than  our  characteriza- 
tion of  the  speech  code  might  have  implied,  since  the  sound  segments  do  not 
map  onto  the  phones  either  in  the  way  they  divide  or  in  the  way  they  group. 
Thus,  rapid  switches  in  sound  source  during  the  articulation  of  successive 
ph ones  may  spread  the  information  about  a single  message  segment  through 
several  acoustic  segments  (Fant,  1962,  1968),  as  when  stop-consonant  closure 
and  release  into  a following  vowel  yield  a brief  silence,  an  explosive 
release,  a period  of  aspirated  noise,  and  a more-or-less  abrupt  voice  onset. 
On  the  other  hand,  coarticulation  may,  as  we  have  previously  noted,  cause  the 
information  about  several  message  segments  to  be  collapsed  into  a single 
segment  of  sound. 

The  severity  of  the  problem  is  evidenced  by  the  fact  that  it  has  resisted 
solution  for  many  years,  as  much  by  those  concerned  with  speech  synthesis 
(Coker  et  al . , 1973)  as  by  those  working  on  automatic  speech  recognition 
(Ainsworth,  1976).  Both  groups  have  been  driven  to  acknowledge  that  segments 
the  size  of  phones  are  not  to  be  found  as  segments  in  the  acoustic  stream;  the 
irreducible  acoustic  unit  is  of  approximately  syllabic  dimensions,  just  as  we 
would  expect  given  the  very  earliest  result  of  research  with  synthetic  speech 
(Cooper  et  al.,  1952;  Liberman  et  al.,  1952).  In  the  first  attempt  to 
'synthesize'  speech  by  commuting  (and  concatenating)  segments  of  sound  excised 
from  prerecorded  utterances,  Harris  (1953)  found  that  the  'building  blocks' 
had  to  be  larger  than  phones.  Other  investigators  (Peterson,  Wang,  and 
Sivertsen,  1958)  later  reported  some  success  in  producing  speech  by  concaten- 
ating prerecorded  segments,  but  the  segments  they  required  were  a numerous  and 
varied  assortment  of  syllables  and  'phoneme  dyads'.  Significant  improvements 
in  this  method  of  synthesis  have  recently  been  made  by  Fujimura  (1975  , in 
press),  though  again  the  unit  must  be  larger  than  the  phone.  And  now,  even  in 
synthesis  by  rule,  Mattingly  (1976)  has  found  it  advantageous  to  preorganize 
the  phonetic  segments  into  syllables  and  then  use  those  larger  units  as  input 
to  his  synthesis  program.  As  for  the  work  on  automatic  speech  recognition,  it 
Ijas  long  been  plain  that  segmentation  into  phones  by  a straightforward 
^coustic  criterion  is  hardly  possible  (Hyde,  1972),  though  segmentation  into 
syllables  can  be  done  reasonably  well  (Mermel stein , 1975). 

The  foregoing  considerations  and  facts  suggest  that  the  phones  are  not 
directly  given  in  perception  but  must  rather  be  derived  from  a running 
analysis  of  the  signal  over  stretches  of  at  least  syllable  length.  There  is 
ample  experimental  evidence  that  this  is  so. 

Consider,  for  example,  the  matter  of  segment  duration  and  its  role  in  the 
perception  of  phones.  It  is  known  that,  in  English,  the  contrast  between 
unreleased  voiced  and  voiceless  stops  in  syllable-final  position  (for  example, 
[ab]  vs.  [ ap]  ) can  be  determined  by  the  duration  of  the  preceding  vowel 
(Denes,  1955;  Raphael,  1972).  But  what  happens,  then,  if  that  vowel  is  itself 
preceded  by  the  consonant-vowel  transitions  appropriate  to,  say,  [b],  as  in 
[bah]  vs.  [bap]?  Does  the  listener  pay  attention  only  to  the  duration  of  the 
preceding  vowel?  Presumably  he  cannot  do  that  if,  as  we  have  suggested,  the 
transition  cues  for  the  consonant  simultaneously  carry  information  about  the 
vowel.  And,  indeed,  he  does  not.  According  to  recent  experiments  (Raphael, 
Dorman,  and  Liberman,  1975),  the  duration  used  by  the  listener  to  determine 
voicing  in  the  final  segment  includes  all,  or  almost  all,  of  the  transition 
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cues  for  the  consonant  in  the  initial  segment. 

Given  only  that  result,  we  might  suppose  nevertheless  that  the  listener 
takes  one  part  of  the  acoustic  signal  as  consonant  and  another  part  as  vowel, 
provided  we  further  suppose  that  the  voicing  of  a syllable-final  stop  is 
determined  by  the  sum  of  the  durations  of  consonant  plus  vowel.  At  least  two 
other  experiments  suggest,  however,  that  the  listener  does  not  compute 
consonant  and  vowel  durations  on  different  parts  of  the  syllable.  In  one  of 
these  experiments^,  listeners  were  asked  to  adjust  the  duration  of  a steady- 
state  vowel  to  match  the  duration  of  the  medial  vowel  in  a stop-vowel-stop 
syllable  whose  formants  had  parabolic  trajectories.  As  determined  by  that 
simple  and  direct  technique,  the  perceived  duration  of  the  medial  vowel  was 
found  to  include  a significant  portion  of  the  consonant-vowel  transitions. 

The  other  experiment  dealt  with  duration  as  a cue  for  the  perceived 
identity  of  a medial  vowel  and,  simultaneously,  with  the  voicing  of  a final 
stop,  for  example,  [bet],  [baet],  [bed],  [baed].5  The  results  clearly  imply  that 
the  listener  did  not  assign  one  part  of  the  syllable  duration  to  the  vowel  and 
another  part  to  the  consonant.  Rather,  - it  was  as  if  he  used  the  whole 
duration  of  the  syllable,  but  used  it  twice:  once  to  determine  the  identity 
of  the  vowel  and  again  to  determine  whether  the  syl 1 ab le- f inal  stop  was  voiced 
or  voiceless. 

That  the  information  about  the  phonetic  segments  is  smeared  through  the 
syllable  is  indicated  also  by  evidence  that  the  flanking  transitions  in  a CVC 
syllable  are  used  to  judge  the  identity  of  the  medial  vowel.  For  example, 
Ochiai  and  Fujimura  (1971)  recorded  natural,  but  distinctly  articulated  words 
and  observed  no  errors  of  vowel  identification.  However,  when  they  presented 
50  msec  portions  gated  from  the  vowel  centers,  listeners'  judgments  frequently 
shifted  in  directions  that  could  be  explained  by  contextual  assimilation. 
Even  more  striking  are  the  results  of  Strange,  Verbrugge,  and  Shankweiler 
(1976).  They  recorded  nine  vowels  spoken  in  isolation,  and  the  same  nine 
vowels  spoken  in  various  CVC  frames.  Despite  the  increased  acoustic  complexi- 
ty introduced  by  a dynamic  syllable  structure,  listeners  correctly  identified 
the  vowels  significantly  more  often  when  they  were  presented  in  a consonantal 
frame,  even  a variable  one,  than  when  they  were  presented  in  isolation.  Thus, 
for  the  purpose  of  identifying  the  vowels,  the  perceiving  system  used  those 
parts  of  the  syllable  that  also  contained  information  about  the  consonants. 
That  is  yet  another  reflection  of  the  complex  relation  in  segmentation  between 
signal  and  message.  But  it  also  shows  that  though  the  perceptual  target  is  a 
vowel,  for  which  static  formant  frequencies  are  often  assumed  (Peterson  and 
Barney,  1952),  the  perceptual  system  nevertheless  prefers  the  dynamic  configu- 
ration of  a syllable,  perhaps  because  it  can  then  take  advantage  of  the  many 
constraints  inherent  in  the  way  the  vocal  apparatus  works  when  it  coarticu-- 
lates . 
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Considering  all  that  is  known  about  the  peculiar  disparity  in  segmenta- 
tion between  perceived  message  and  transmitted  signal,  we  suppose  that  the 
appropriately  segmented  percept  lies  at  some  remove  from  the  immediately  given 
auditory  pattern,  and  that  it  is  recovered  by  processes  different  from  those 
the  auditory  system  is  ordinarily  called  on  to  provide.  As  for  the  possibili- 
ty that  such  special  processes  make  reference  to  production,  we  can  offer  no 
direct  evidence  about  segmentation  as  such,  only  the  observation  that  to  find 
the  segments,  it  must  help  to  understand  where  they  were  lost. 

Phonet ic  interpretat ion  of  the  sounds  of  speech . We  should  now  look  more 
directly  at  some  phenomena  of  speech  perception  that  depend,  presumably,  on 
the  same  decoding  processes  that  perform  the  segmentation  but  pertain  more 
closely  to  what  those  segments,  once  retrieved,  sound  like  to  a listener.  Do 
they  sound  like  other  sounds  or  do  they  not?  And  when  not,  is  there  evidence 
of  a link  to  production? 

Impressions  of  the  difference  between  auditory  and  phonetic  modes. 
To  convey  a feeling  for  what  we  mean  by  the  suggestion  that  the  sound  of 
speech  is  different  from  the  sound  of  nonspeech,  it  may  be  useful  to  describe 
several  phenomena  that  are  part  of  the  experience  of  people  who  work  with 
synthetic  speech.  One  of  these  is  reflected  in  an  observation  made  by 
investigators  who  used  the  Pattern  Playback,  an  early  research  synthesizer 
that  converted  hand-painted  spectrograms  and  other  designs  into  sound  (Cooper, 
1950;  Cooper,  Liberman,  and  Borst,  1951;  Cooper,  1953).  Having  succeeded  in 
constructing  highly  schematized  spectrograms,  like  the  one  at  the  top  of 
Figure  1,  that  nevertheless  produced  intelligible  speech,  the  investigators 
thought  to  take  advantage  of  the  flexibility  of  the  Playback  in  order  to 
destroy  the  intelligibility  of  speech  by  a novel  and,  they  assumed,  uniquely 
effective  procedure:  instead  of  drowning  the  speech  in  noise,  which  was  the 
usual  way,  they  would  'mislead'  the  ear.  To  that  end  they  added  to  the 
spectrogram  'false'  formants,  always  continuous  with  the  'true'  formants,  that 
improperly  connected  and  extended  the  proper  components  of  the  acoustic 
pattern.  An  example  is  shown  in  the  middle  and  at  the  bottom  of  Figure  1.  In 
fact,  as  the  reader  can  see,  the  eye  is  misled.  But  the  ear  was  not.  When 
the  altered  pattern  was  converted  to  sound,  the  listener  heard  the  original 
phonetic  message  against  a loud  background  of  variously  pitched  whistles.  It 
was  as  if  the  perceptual  machinery  had  separated  the  acoustic  effects  that  a 
vocal  tract  can  produce  from  those  it  cannot.  At  all  events,  the  effect  was 
of  two  qualitatively  different  kinds  of  perception — articulate,  monotone 
speech  in  the  one  case,  complex  and  very  bad  'music'  in  the  other. 

Much  the  same  kind  of  phenomenon,  though  on  a smaller  scale,  can  be 
produced,  not  only  on  a device  like  the  Pattern  Playback,  but  also  on  the  more 
modern  parallel-resonance  synthesizers  now  in  common  use.  An  example  is  seen 
in  the  contrast  between  the  initial  stop  consonants  of  the  syllables  [ba]  and 
l gal.  As  shown  in  Figure  2,  a sufficient  acoustic  cue  is  the  direction  of  the 
second- formant  transitions,  rising  for  f b 1 and  faLling  for  [gj.  Now,  given 
our  knowledge  of  psychoacoustics,  we  should  suppose  that  those  cues  would 
sound  like  rising  and  falling  glissandos  or  like  chirps  of  different  pitch, 
depending  on  how  rapidly  the  formants  moved  on  the  frequency  scale.  And,  in 
fact,  when  we  present  the  formant-transition  cues  by  themselves,  as  shown  in 
the  inset  of  Figure  2,  that  is  exactly  how  they  do  sound  (Mattingly,  Liberman, 
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Figure  1:  At  the  top,  a hand-drawn  spectrogram  appropriate  for  synthesis  of  a 

senterce;  in  the  middle,  a pattern  intended  to  'mask'  the  sentence 
by  'misleading'  the  ear;  and,  at  the  bottom,  the  composite  of 
sentence  and  'mask,'  which  produces,  again,  the  perception  of  the 
sentence,  plus  a dissociated  set  of  whistles  and  noises. 
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Syrdal,  and  Halves,  1971;  Shattuck  and  Klatt,  1976).  But  what  do  we  say, 
then,  about  the  fact  that  those  same  transitions  are  heard  in  the  context  of 
speech  as  the  abstract  linguistic  events  we  can  only  describe  as  [b]  and  [ g ] ? 
Of  course,  the  transition  cues  are  isolated  in  the  one  case  but  part  of  a 
larger,  if  otherwise  constant,  pattern  in  the  other,  so  that  we  might 
attribute  the  difference  in  perception  to  some  kind  of  auditory  interaction. 

But  even  when  the  transition  cues  are  in  exactly  the  same  acoustic 
context  it  is  possible  to  hear  them,  simultaneously,  as  phonetic  stops  and 
auditory  chirps.  That  effect  was  created  by  Rand  (1974)  in  the  following  way. 
Into  one  ear  he  put  all  of  the  first  formant  and  the  steady-state  parts  of  the 
second  and  third  formants,  while  into  the  other  ear  he  put  just  the  transition 
cues  (of  the  second  and  third  formants)  that  distinguish  [ba]  and  [ga],  being 
careful  to  synchronize  them  properly  with  respect  to  the  rest  of  the  pattern. 
Though  there  is  but  one  context — and  indeed  one  brain — the  formant  transitions 
will,  in  this  situation,  often  simultaneously  produce  two  very  different 
perceptions:  the  syllable  [ba]  (or  [ga])  and  a rising  (or  falling)  chirp. 

Essentially  the  same  kind  of  effect  has  been  created,  though  successively 
now  instead  of  simultaneously,  as  part  of  an  experiment  designed  by  Bailey, 
Dorman,  and  Summer  field  (1977)  to  permit  a comparison  of  speech  and  nonspeech 
perception.  The  stimulus  patterns  are  similar  to  those  commonly  used  in 
research  with  synthetic  speech  in  that  they  contain  transitions  appropriate  to 
several  stop  consonant-vowel  combinations,  followed  by  vowel  steady  states; 
they  differ  from  those  normally  used  in  that  the  formants  are  replaced  by  pure 
tones,  one  for  each  formant  and  set  to  its  center  of  energy.  On  first  being 
presented  with  such  patterns,  listeners  hear  them  as  a complex  of  tones,  but 
after  some  time  they  begin  to  hear  them  as  speech.  We  will  not  here  presume 
to  report  on  the  results  of  the  experimental  comparisons  that  the  study  was 
designed  to  permit;  we  only  remark  the  phenomenon,  which  is  that  there  is  a 
striking  difference  in  subjective  impression,  depending  on  whether  the  lis- 
tener is  perceiving  the  stimulus  patterns  as  tones  or  as  speech;  thus,  it 
offers  yet  another  way  to  gain  a general  appreciation  of  the  perceptual 
differences  between  speech  and  nonspeech. 

At  all  events,  it  is  just  such  qualitative  contrasts  in  perception  as  we 
have  described  here  that  can  convey  to  a listener  a direct  impression  of  what 
we  mean  by  the  distinction  between  auditory  and  phonetic  modes.  We  turn  now 
to  some  relevant  experimental  observations. 

Acoustic  cues  as  a source  of  information  about  the sjjeaker ' s vocal 

tract . Those  aspects  of  the  speech  signal  that,  when  varied,  cause  phoneti- 
cally significant  changes  in  perception  are  known  as  "acoustic  cues."  It  is  to 
those  cues  that  we  should  now  look,  because  we  find  there  the  clearest 
evidence  for  the  link  between  perception  and  production  that  characterizes 
perception  in  the  phonetic  mode.  No  single  piece  of  evidence  is,  by  itself, 
wholly  convincing;  it  is  only  the  pattern  that  tells.  For  when  we  view  the 
data  in  the  light  of  known  or  imaginable  auditory  processes,  we  see  a number 
of  unconnected  facts  that  require,  apparently,  an  equal  number  of  ad  hoc 
assumptions.  If  we  apply  Occam's  razor,  however,  we  find  a more  or  less 
comfortable  fit  to  the  single  assumption  that  underlies  this  chapter:  that 
the  acoustic  cues  are  processed,  not  only  in  the  auditory  system,  but  also  at 
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Figure  2:  Spectrographic  patterns  suTficient  for  synthesis  of  [ba]  and  [ ga  ] . 

Inset:  The  second- forman t transitions  that  cue  the  perceived 

difference  between  the  syllables,  but  sound,  in  isolation,  like 
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some  more  abstract,  phonetic  remove;  there,  an  appropriately  specialized 
device  uses  the  articulatory  information  provided  by  those  cues  to  shape  the 
listener's  perception  of  what  the  speaker  said. 

A simple  example.  To  see  how  an  acoustic  cue — silence — might 
provide  information  about  a phonetically  important  gesture,  we  should  consider 
the  following  facts  about  fricatives  and  stop  consonants.  A speaker  cannot 
produce  a stop  consonant  without  closing  his  vocal  tract  for  a brief  period, 
and  he  cannot  close  his  vocal  tract  without  producing  a period  of  silence. 
Hence,  silence  might  be  important  to  the  perception  of  stop  consonants, 
especially  if  the  perceptual  processes  'know'  that  stops  require  closure  and 
that  closure  results  in  silence.  It  is  relevant,  then,  to  discover  that  in 
the  perception  of  stops,  silence  is,  in  fact,  an  important  condition. 

Suppose,  for  example,  we  record  the  fricative-vowel  syllable  [sa].  As 
shown  schematically  in  Figure  3,  the  acoustic  pattern  consists  of  a patch  of 
noise,  associated  with  the  fricative,  followed  by  a vocalic  section.  The 
vocalic  section  begins  with  the  formant  transitions  characteristic  of  the 
fricative  [s]  when  coarticulated  with  the  vowel  [a];  there  follow,  then,  the 
steady-state  formants  characteristic  of  the  (drawn-out)  vowel  [a].  It  should 
be  noted  about  the  formant  transitions  at  the  beginning  of  the  vocalic  section 
that  they  are  also  appropriate,  at  least  approximately,  for  the  stop  conso- 
nants [t]  and  [d],  which  have  the  same  place  of  production  as  Is].  Now  if  we 
remove  the  patch  of  noise,  listeners  will  commonly  hear  [ta],  not  [a] — that 
is,  they  will  hear  a stop  consonant  where  none  was  before.  If  we  now  replace 
the  s-noise  in  such  a way  as  to  create  a silence  of  50  or  so  msec  between  it 
ard  the  vocalic  portion,  listeners  will  again  hear  the  stop,  this  time  in 
[sta].  We  should  say,  parenthetically,  that  the  same  kind  of  effect  can  be 
obtained  starting  with  a stop-vowel  syllable  like  [ta].  In  that  case,  putting 
s-noise  immediately  in  front  of  the  syllable  will  cause  the  listener  to  hear 
[sa],  not  [sta];  if  the  listener  is  to  hear  [sta],  we  must  create  a short 
period  of  silence  between  the  s-noise  and  the  vocalic  section. 

We  see  in  this  example  that  silence  has  just  the  sound  we*  should  expect 
it  to  have,  given  the  assumption  that  it  tells  the  listener  whether  or  not  the 
speaker  closed  his  vocal  tract  long  enough  to  have  produced  a stop  consonant. 
But,  surely,  there  might  be  other,  perhaps  more  parsimonious , assumptions.  We 
note  in  this  connection  that  our  examples  conform  to  the  paradigm  for  auditory 
forward  masking,  so  we  should  take  account  of  the  possibility  that  the 
transition  cues  are  simply  being  masked  when  the  noise  is  too  close  to  them; 
or  we  suppose,  more  vaguely,  that  there  is  some  (not  previously  discovered) 
auditory  interaction  between  silence  and  the  transition  cues  which  causes  us 
to  hear  the  peculiar  sound  of  a stop  consonant. 

But  there  is  considerable  evidence  that  such  alternative  assumptions  will 
not  hold.  Note,  first,  that  in  fr icat ive-vowel  syllables  like  the  [sa]  of  our 
example,  it  has  been  found  that  the  formant  transitions  contribute  signifi- 
cantly to  the  perception  of  the  fricative  (Harris,  1958;  Darwin,  1971).  We 
should  suppose,  therefore,  that  the  transition  cues  are  'getting  through' — 
that  is,  they  are  not  being  masked  by  the  s-noise.  It  is  only  their 
(phonetic)  interpretation  as  fricative  (when  the  silence  is  relatively  short) 
or  stop  (when  the  silence  is  relatively  long)  that  is  affected. 
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More  evidence  of  the  same  kind  comes  from  a study  of  selective  adaptation 
by  Ganong  ( 1975).  There,  the  first  step  was  to  measure  the  shift  in  the 
(perceived)  boundary  between  [b]  and  [dl  caused  by  adaptation  with  che 
syllable  [ de ] . Then,  a patch  of  s-noise  was  placed  in  front  of  the  [del  so 
that  it  sounded,  as  in  our  example,  like  [se].  When  that  syllable  ((sel)  was 
used  as  the  adapter,  the  effect  on  the  [b-d]  boundary  was  found  to  be  just  as 
great  as  it  had  been  with  [del.  From  that  it  follows  not  only  that  the 
transition  cues  were  getting  through — that  is,  that  they  were  not  being 
blocked  by  the  noise  when  they  were  perceived  as  [sc]  rather  than  as  [de]--but 
that  they  were  getting  through  in  full  strength. 

A third  kind  of  evidence  comes  from  a comparison  of  how  the  transition 
cues  are  perceived  when,  in  an  acoustic  context  otherwise  like  that  of  our 
example,  they  are  in  or  out  of  a proper  syllable  (Dorman,  Raphael,  Liberman, 
and  Repp,  1975).  The  syllable  consisted  of  a patch  of  s-noise  followed  by  a 
vocalic  portion  that  was  either  [pe]  or  [ke].  With  the  noise  up  close, 
listeners  reported  hearing  [sc],  not  [spc]  or  [ske];  [spel  and  [ske]  were 
perceived  only  when  there  was  an  appropriate  interval  of  silence  between  the 
noise  and  the  rest  of  the  syllable.  In  the  other  (nonsyllable,  nonspeech) 
condition,  the  transition  cues  were  isolated  from  the  rest  of  the  vocalic 
section,  in  which  circumstance  they  sounded  like  'chirps'  of  different  pitcn 
and  could  easily  ba  identified  on  that  basis;  then  they  were  placed,  as  in  the 
speech  patterns,  after  the  patch  of  s-noise.  In  that  condition — that  is,  when 
heard  as  'chirps' — the  transition  cues  were  correctly  identified  even  when 
there  was  no  silent  interval  separating  them  from  the  noise.  Thus,  they  were 
not  significantly  masked  by  the  noise,  but,  just  as  important  from  our  point 
of  view,  their  perception  was  not  changed  in  any  qualitative  way — that  is, 
there  was  no  apparent  interaction  among  noise,  silence,  and  transitions. 

Much  the  same  kind  of  result  has  been  obtained  with  stops  in  syllable- 
final  position  (Dorman,  et  al.,  1975).  First,  it  was  established  that  in  the 
disyllables  [beb  de]  and  [beg  de  ] listeners  could  correctly  perceive  the 
syllable-final  stops  [b]  and  [g]  only  if  there  was  a sufficient  period  of 

silence  (approximately  60  msec)  between  the  syllables.  Then,  the  second- 
formant  transitions  that  were  the  only  acoustic  difference  between  the  [b]  and 
the  [g]  were  isolated  from  the  rest  of  the  pattern  of  the  first  syllable,  in 
which  circumstance  they  were  heard  as  two  quite  different  chirps,  and 
presented,  as  in  the  first  condition,  before  the  syllable  [de].  Listeners 
correctly  identified  the  chirps  most  of  the  time,  even  when  there  was  no 

silence  at  all  between  them  and  [de];  the  amount  of  masking  was  relatively 
slight,  nothing  at  all  like  the  total  effect  that  had  occurred  in  the  case  of 
the  speech  sounds,  and  there  appeared,  again,  to  be  no  interaction-caused 
change  in  the  phenomenal  'quality'. 

So  much,  then,  for  the  possibility  that  silence  is  a necessary  condition 
for  perception  of  stops  because  it  prevents  masking  of  the  transitions  or 
because  it  collaborates  in  some  auditory  interaction  with  them.  We  turn  now 
to  the  fact  that  in  the  absence  of  transitions  and  other  stop— consonant  cues, 
silence  can  be  a more  nearly  suf f ic ient  condition  for  perception  of  a stop. 

Suppose  we  insert  the  appropriate  amount  of  silence  between  the  noise  of 

a fricative  and  a vocalic  section  so  structured  that  no  stop  is  heard  when  it 

is  presented  by  itself.  Begin,  for  example,  with  the  syllable  [lit],  then  put 
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a patch  of  s-noise  in  front  of  it.  In  that  case,  the  resulting  syllable  is 
perceived  as  [slit]  if  there  is  no  silence  between  the  noise  and  the  vocalic 
section,  but  as  [split]  if  the  silence  is  increased  sufficiently  (Dorman, 
Raphael,  and  Liberman,  1976;  Erickson,  Fitch,  Halwes,  and  Liberman,  1977). 
For  a simpler  example,  consider  that  an  appropriate  amount  of  silence  inserted 
between  a patch  of  s-noise  and  the  vowel  [i]  will  produce  [ski];  a similar 
arrangement  with  [u]  will  produce  [spu]  (Summerfield  and  Bailey,  1977). 
Notice,  too,  in  these  last  cases  that  silence  is  not  only  a sufficient  cue  for 
stop  consonant  manner  but  that  the  'place'  of  the  perceived  stop  (whether  [k] 
or  [p])  is  different,  of  which  more  later. 


Silence  has  also  been  shown  to  be  a sufficient  condition  for  distinguish- 
ing fricative  from  affricate  both  in  syllable-initial  and  syllable-final 
positions.  Thus,  one  can  record  the  word  'say'  and  the  word  'shop'  and  then 
convert  between  'say  shop'  and  'say  chop'  by  varying  the  interval  of  silence 
between  the  two  words  (Dorman,  et  al.,  1976).  Or  one  can  record  'dish'  and 
convert  it  to  ’ditch’  by  introducing  an  appropriate  amount  of  silence  between 
the  vocalic  part  of  the  syllable  and  the  fricative  noise  at  the  end. 6 

The  foregoing  considerations  all  imply  that  the  perception  of  silence  in 
our  simple  example  is  not  only  auditory  but  also  phonetic.  As  a phonetic 
percept,  it  conforms  to  a fact  about  the  speaker's  production — namely,  that  a 
stop  consonant  cannot  be  produced  without  closing  the  vocal  tract.  Of  course, 
such  conformity  could  occur  only  if  there  were  a phonetic  perceiving  device 
specialized  to  make  use  of  the  information  about  articulation,  and  if  there 
were,  correspondingly , a phonetic  mode  of  perception. 

Equivalence  in  phonetic  perception  of  different  acoustic  ..cues 
produced  by  the  same  articulatory  gesture.  It  is  a commonplace  of  speech  and 
speech  perception  that  different  acoustic  cues  may  have  equivalent  effects  in 
phonetic  perception.  That  is  of  interest  because  the  cues  are  often  so 
different  acoustically  that  it  is  hard  to  conceive  how  they  might  be  related 
from  an  auditory  point  of  view.  The  relevant  facts  fall  into  several  classes; 
we  will  here  offer  samples  of  each. 

Perhaps  the  simplest  class  comprises  those  ubiquitous  cases  in  which 
there  are  multiple  (and  distributed)  acoustic  consequences  of  the  same 
articulatory  gesture.  Consider  again  the  example  of  the  preceding  section 
that  is  owed  to  Summerfield  and  Bailey:  an  appropriate  interval  of  silence 
between  a patch  of  s-noise  and  the  vowel  [i]  (or  [ u ] ) causes  the  listener  to 
hear  [k]  in  [ski]  (or  [p]  in  [spu]).  We  now  represent  that  fact  schemat ically 
in  the  top  half  of  Figure  4.  In  the  bottom  half  we  represent  the  companion 
fact,  uncovered  in  the  earlier  research  on  the  'locus'  of  the  stops,  that  a 
rising  transition  at  the  beginning  of  the  first  formant  of  [i]  (or  [u])  will 
also  cause  a listener  to  hear  the  stop  [k]  in  [kil  or  ( [ p ] in  [pul)  (Delattre, 
Liberman,  and  Cooper,  1955).  Now  we  note  the  perceptual  equivalence  of  60  or 
so  msec  of  silence,  which  is  the  cue  in  the  top  half  of  the  figure,  and  the 
rising  frequency  modulation  at  the  beginning  of  the  first  formant,  which  is 
the  cue  in  the  bottom  half,  and  we  ask  what  that  amount  of  silence  and  that 
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kind  of  sound  could  possibly  have  in  common.  Nothing,  we  should  think,  when 
we  consider  them  from  an  auditory  point  of  view,  but  in  articulation  they  have 
an  obvious  bond.  To  say  [ski]  (or  [spu]),  rather  than  [si]  (or  [su]  ),  the 
speaker  must  close  his  vocal  tract,  which  produces  the  silent  interval;  and 
then  he  must  open  it,  which  produces  the  rise  in  frequency  of  the  first 
formant.  Thus,  the  two  very  different  cues  are  the  distributed  acoustic 
results  of  an  essential  component  of  the  stop-consonant  gesture.  Given  that 
they  sound  alike — either  can  produce  the  perception  of  stop  consonant — we 
should  suppose  it  is  because  they  refer  to  the  same  articulation. 

For  this  same  example,  it  remains  to  take  account  of  the  fact  that  the 
perceived  stops  had  two  different  places  of  production,  velar  in  [ki]  (or 
[ski])  and  labial  in  [pu]  (or  [spu]).  We  note,  first,  that  energy  at 
frequency  levels  corresponding  to  the  second- formant  levels  of  [ i ] and  [u]  is 
appropriate  for  closure  of  the  vocal  tract  at  the  velar  and  labial  places, 
respectively.  That  helps  us  to  understand  why  [i]  becomes  [ki]  (or  [ski])  and 
[u]  becomes  [ pu]  (or  [spu])  when  sufficient  cues  for  the  stop  manner  are 

added.  But  notice  now  a fact  that  is  more  relevant  to  our  present  purposes, 
which  is  that  these  differences  in  perception  of  place  of  production  occur  in 
the  same  way  regardless  of  how  the  manner  dimension  was  signaled.  Thus,  our 
two  very  different  acoustic  cues — silence  and  sound — are  equivalent,  not  only 
in  their  ability  to  produce  the  perception  of  manner,  but  also  in  the  way  they 
combine  with  the  other  information  in  the  signal  to  produce  the  perception  of 
phonetic  place. 

Given  our  assumption  of  a link  between  production  and  perception,  and 
given  that  a linguistically  significant  gesture  almost  always  has  multiple 
acoustic  consequences,  we  should  expect  to  find  many  other  instances  of 

phonetic-perceptual  equivalence  among  cues  that  are  very  different  in  acoustic- 
auditory  terms.  Just  how  many  must  depend  on  how  finely  we  dissect  the 
acoustic  signal  into  separate  cues,  and  how  often,  in  experiment,  we  play  the 

cues  off  against  each  other.  Relevant  studies  have  already  made  an  impressive 

record.  It  reaches  back  in  time  to  an  extension  by  Lisker  ( 1957b)  of  an 
earlier  study  (Lisker,  1957a)  on  the  voicing  distinction  in  poststress 
position  (as  in  'rabid1  vs.  'rapid').  Having  determined  in  the  earlier  work 
that  duration  of  intersyllabic  silence  is  an  important  voicing  cue,  Lisker 
then  found  that  specifiable  amounts  of  that  temporal  cue  could  be  traded  for 
specifiable  settings  of  spectral  cues  (extent  of  appropriate  transitions  of 
the  first  formant  at  the  end  of  the  first  syllable  and  the  beginning  of  the 
second).  Now,  in  a recent  experiment  on  the  distinction  between  fricative- 
vowel  and  fricative-stop-vowel,  Summerfield  and  Bailey  (1977)  have  established 
and  precisely  measured  the  equivalence  of  silence  on  the  one  hand,  and,  on  the 
other,  such  spectral  cues  as  the  frequency  at  which  the  first  formant  starts 
and  the  extent  of  the  first- formant  transition. 

There  is  also  evidence  of  equivalence  in  phonetic  perception  among 
different  kinds  of  temporal  cues.  Referring  again  to  Lisker' s experiment,  we 
note  his  finding  of  an  equivalence  between  duration  of  intersyllable  silence 
and  the  duration  of  the  first  syllable  of  the  word.  In  a recent  experiment,! 
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referred  to  earlier,  on  the  distinction  between  [dish]  and  [ditch]  there  is  an 
equivalence  between  the  duration  of  silence  separating  the  vocalic  position  of 
the  syllable  from  the  noise  and  the  duration  of  the  noise  portion  of  the 
fricative  (or  affricate).  Also  new  is  the  discovery  of  a similar  equivalence 
between  duration  of  silence  and  duration  of  noise  in  the  contrast  between 
fricative-vowel  and  fricative-stop-vowel.®  In  all  these  cases  time  is  traded 
for  time;  but  in  the  one  period  of  time  there  is  silence,  in  the  other  sound. 

In  the  spectral  domain,  too,  equivalences  among  different  cues  are  not 
hard  to  find.  For  example,  an  early  paper  (Cooper  et  al.,  1952)  presented 
preliminary  evidence  for  the  separate  contributions  of  several  acoustic  cues 
to  the  perception  of  the  [m-£]  distinction,  among  others.  Later,  it  was  shown 
more  clearly  that,  in  the  perception  of  place  of  production  in  stops,  second- 
and  th ird- formant  transitions  made  independent  contributions  (Harris,  Hoffman, 
Liberman,  Delattre  and  Cooper,  1958;  Hoffman,  1958;  see  also  Dorman,  Studdert- 
Kennedy  and  Raphael,  in  press).  In  the  current  literature  is  a particularly 
elegant  study  of  the  voicing  distinction  by  Summerfield  and  Haggard  (1977) 
that  reports  an  equivalence  between  the  starting  point  of  the  first  formant 
and  the  variable  known  as  'voice-onset-time’  and  shows  explicitly  how  these 
acoustically  disparate  cues  are  related  in  articulation.  A somewhat  similar 
result  with  two  voicing  cues — frequency  of  the  fundamental  frequency  and 
voice-onset-t irae — has  been  found  recently  by  Massaro  and  Cohen  (1976)  [cf. 
Haggard,  Ambler,  and  Callow,  (1970)],  though  an  articulatory  basis  was  not 
made  explicit. 


Having  offered  several  examples  of  the  equivalences  in  phonetic  percep- 
tion between  different  acoustic  cues  that  are  the  consequences  of  the  same 
articulation,  we  should  bring  this  section  to  a close.  But  not  without  first 
saying  that  it  is  hard  to  know  where  the  list  of  relevant  examples  should  end. 
Should  we,  for  example,  include  the  kind  of  equivalence  that  is  found  between 
spectral  cues  for  syllable-initial  consonants  and  the  duration  of  the  syll- 
able,^ or  between  silence  as  a cue  (for  voicing,  or  place,  or  gemination)  and 
the  tempo  of  the  surrounding  speech  (Pickett  and  Decker,  1960;  Port,  1976),  or 


between  the  setting  of  the  second-formant  transition  as  a cue  for  the  stops 
and  the  position  of  the  first  formant  (Rand,  1971  )?1°  It  is  when  we  try  to 
answer  that  question,  and  thus  to  define  the  boundaries  of  the  phenomenon  we 
are  here  considering,  that  we  see  most  clearly  how  unsatisfactory  from  a 
theoretical  point  of  view  is  the  notion  of  acoustic  cue.  We  find  it  useful, 
even  necessary,  when  we  want  to  refer  to  those  pieces  of  sound  that  an 
experimenter  varied  and  found  to  be  effective.  But  if  the  cues  are  to  be 
fitted  into  a conceptual  frame — as  something  other  than  items  in  a list — we 
should  regard  them  as  information  about  the  behavior  of  a speaker's  vocal 
tract . 

So  far  we  have  considered  only  those  different  acoustic  cues  that  are 
phonetically  equivalent  because  they  are  the  common  products  of  a single 
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articulatory  gesture.  These  are,  perhaps,  the  least  complex  and  most  telling 
of  the  instances  that  imply  a link  between  speech  perception  and  speech 
production.  But  they  are  not  the  only  ones.  Equally  numerous  are  the  cases 
in  which  there  is  phonetic  equivalence  between  acoustic  cues  that  are  very 
different  because  the  phone  they  signal  is  produced  in  different  contexts 
(Liberman,  et  al.,  1967;  but  see  Stevens,  1975).  In  these  cases,  too,  we 
suppose  that  a common  articulation  is  responsible  for  that  which  is  common  in 
the  perception.  Of  course,  such  articulations  as  these  can  hardly  be 
identical  in  all  particulars,  since  they  are  linked  to  the  gestures  for  the 
surrounding  phones,  and  these  change,  of  course,  with  each  new  context;  the 
commonality  can  only  be  seen  in  terms  of  shared  components,  whether  end 
targets  or  inferred  motor  commands.  (For  relevant  discussion,  see  MacNeilage, 
1970).  But  given  such  articulatory  similarity  as  there  may  be,  gross 
differences  in  acoustic  signal  can  and  often  do  arise  with  changes  in  context, 
primarily  as  a consequence  of  coarticulation.  It  is  the  more  important,  then, 
to  give  some  attention  to  these  context-conditioned  variations  in  the  cues 
because,  as  we  said  in  an  earlier  section,  coarticulation  is  the  essence  of 
the  speech  code. 

To  illustrate  how  acoustic  cues  that  vary  because  of  phonetic  context  are 
nevertheless  equivalent  in  phonetic  perception,  we  choose  an  example  that 
shows  two  kinds  of  contextual  effects,  one  that  depends  on  variations  in  the 
identity  of  the  phone  following  the  target  phone,  and  another  having  to  do 
with  variations  in  the  position  of  the  target  phone  in  the  syllable.  The 
example  is  the  pair  of  syllables  [did]  and  [dud],  shown  schematically  as  two- 
formant  approximations  in  Figure  5,  and  taken  from  the  results  of  early 
experiments  on  the  stops  (Delattre,  et  al.,  1955).  (These  patterns  are 
appropriate,  and  reasonably  sufficient,  for  synthesizing  the  intended  syll- 
ables.) Having  noticed  that  the  lower  (first)  formant  is  the  same  in  the  two 
cases,  we  fix  attention  on  the  higher  (second)  one.  We  see  there  that,  as  a 
consequence  of  coarticulation,  a phonetic  alteration  limited  to  the  middle 
(vowel)  segment  of  a consonant-vowel-consonant  syllable  does  not  change  only 
the  middle  portion  of  the  sound;  rather,  it  changes  the  entire  second  formant. 
The  transition  cues  for  [d]  are  therefore  in  very  different  positions  in  the 
spectrum,  being  relatively  high  in  frequency  for  [did]  and  low  for  [dud]. 
Moreover,  the  transition  cues  for  stops  in  corresponding  positions  in  the 
syllable  are  opposite  in  direction — for  [did]  they  are  rising  in  initial 
position  and  falling  in  final  position,  but  for  [dud]  they  are  falling  in 
initial  position  and  rising  in  final  position.  Of  course,  the  inference  we 
would  draw  from  these  cases  is  much  the  same  as  that  we  draw  from  those  in 
which  the  context  was  fixed  and  the  disparate  acoustic  cues  were  the  products 
of  exactly  the  same  gesture:  the  cues  are  presumably  interpreted  by  a 

phonetic  device  that  acts  as  if  it  knew  how  they  were  produced.  But  if  the 
device  has  that  ability,  then  it  can  conceivably  do  more  than  just  'hear 
through'  the  context-conditioned  variation  in  the  cues  so  as  to  arrive  at  the 
canonical  form  of  the  phonetic  segment;  it  might  also  be  able  to  take 
advantage  of  the  fact  that  such  variation  produces  a special  kind  of 
redundancy  in  the  signal  and  provides  important  information  about  such  aspects 
of  the  phonetic  structure  as  sequential  order,  juncture,  linguistic  stress, 
and  tempo.  If  so,  then  the  acoustic  variation  that  is  produced  by  articula- 
tion (and  coarticulation)  in  different  contexts  would  not  be  an  obstacle  to 
perception  but  a considerable  help  and,  correspondingly,  a most  important 
characteristic  of  the  speech  code. 
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Non-Eguivalence  in  phonetic  perception  of_  an  _^acoust_ic_  _c_ue 
produced  by  different  vocal  tracts:  ecological  constraints  of  a jjhonetic. 

sort*.  Given  that  phonetic  perception  is  somehow  shaped  by  what  a vocal  tract 
does,  as  we  have  suggested  it  might  be,  we  should  ask:  whose  vocal  tract? 

Common  sense  suggests  that  it  can  hardly  be  that  of  the  listener,  nor  yet  that 
of  the  speaker;  most  plausibly,  it  must  be  some  abstract  conception  of  vocal 
tracts  in  general.  We  should  expect,  then,  that  the  phonetic  device  would 
behave  as  if  it  knew,  for  example,  that  two  vocal  tracts  can  do  what  one  vocal 
tract  can  not.  In  that  case,  acoustic  cues  might  have  one  effect  or  another, 
depending  on  whether  they  were  produced  by  one  speaker  or  by  two.  That  such 
ecological  considerations  are  important  is  indicated  by  experiments. 

One  dealt  with  the  perception  of  the  syllable-final  stop  in  the  example 
of  [eb  1 e ] vs.  [eg  de]  that  we  described  earlier.  There,  it  will  be  remem- 
bered, listeners  could  hear  the  [b]  or  [g]  only  if  there  was  a sufficient 
interval  of  silence  between  the  syllables,  presumably  because  the  phonetic 
perceiving  device  'knew'  that  the  speaker  could  not  have  produced  both  stops 
without  closing  his  vocal  tract  for  a certain  period  of  time.  But  two  vocal 
tracts — one  saying  [eb]  (or  [eg]),  the  other  [de] — can  produce  the  disyllable 
[eb  de]  (or  [eg  de]  ) with  no  silence  at  all  between  the  two  syllables.  The 
experiment  revealed  that  listeners  behaved  accordingly:  when  a single  speaker 

produced  both  first  and  second  syllables,  a silent  interval  of  some  duration 
was  necessary  to  perception  of  the  syllable-final  stops,  but  when  one  speaker 
produced  the  first  syllable  and  another  the  second,  listeners  heard  the 
syllable-final  stops  even  when  there  was  no  intersyllabic  silence  at  all. 
(Dorman,  et  al.,  1975). 

The  other  experiment  dealt  with  the  distinction  between  fricative  and 
affricate  (in  'shop'  vs.  'chop')  that  we  also  described  earlier.  In  that 
case,  inserting  a sufficient  amount  of  silence  between  'say'  and  'shop'  caused 
the  listener  to  hear  'chop'.  Our  assumption  was  that  this  occurred  because 
the  silence  informed  the  listener  that  the  speaker  had  closed  his  vocal  tract, 
as  he  must  to  produce  the  affricate.  But  two  vocal  tracts — one  saying  'now 
say'  and  the  other  'chop' — can  produce  ’now  say  chop'  with  no  silence  at  all 
between  'say'  and  'chop.'  Thus,  with  two  speakers,  the  size  of  the  interval  of 
silence  provides  no  useful  phonetic  information.  The  results  of  the  experi- 
ment suggested  that  the  listeners'  perceptions  took  account  of  that  fact. 
Starting  with  'now  say'  and  'shop',  and  given  a silent  interval  appropriate 
for  'chop',  listeners  did  indeed  hear  'now  say  chop'  if  there  was  only  one 
speaker;  but  if  there  were  two,  then  listeners  heard  'now  say  shop'  at  all 
intervals  of  silence  (Raphael,  Dorman,  and  Liberman,  1975). 

Those  results  imply  that  the  vocal  tract  to  which  the  perception  is 
linked  is  a very  abstract  one  indeed,  as  we  should  have  expected.  But  they 
also  provide  additional  support,  and  of  a rather  different  kind,  for  the 
hypothesis  that  some  such  link  does  indeed  exist. 

Addit  ion  _o_f__e_qui_Yale_nt  acoustic  cue_s:_  _ algebraic  sums  in  the 
• The  claim  that  two  very  different  acoustic  cues  are  equivalent 
in  phonetic  perception  is  largely  based  on  the  experimental  demonstration  of  a 
trading  relation  between  them.  Thus,  it  has  been  determined  that  sc  le  number 
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of  milliseconds  of  a temporal  cue  is  equal  to  some  particular  setting  of  a 
spectral  cue.  An  implication  is  that  the  two  cues  together  will  summate 
algebraically  to  enhance  or  reduce  the  perceived  phonetic  contrast,  depending 
on  just  how  they  are  combined.  We  believe  that  to  be  worth  remarking  because 
cues  that  are  algebraical ly  summed  would  have  positive  and  negative  signs  only 
in  the  phonetic  domain,  or  so  it  would  seem.  An  example  may  show  why. 

Recall  the  fact,  described  earlier,  that  an  appropriate  period  of  silence 
inserted  between  an  s-noise  and  the  syllable  [lit]  will  produce  [slit]  if  the 
interval  is  relatively  short  but  [split]  if  it  is  sufficiently  long.  Given 
that  we  can,  of  course,  also  convert  [lit]  to  [plit]  by  appropriately  changing 
the  spectrum  at  the  beginning  of  the  vocalic  syllable — specifically,  by 
altering  the  formant  transitions — it  follows  that  we  can  use  the  spectral 
maneuver  to  interconvert  between  [slit]  and  [split]  while  holding  the  temporal 
cue  fixed  (Erickson,  et  al.,  1977).  Those  facts  are  diagrammed  in  Figure  6 as 
Pairs  I and  II,  where  we  characterize  the  cues  as  'minus  p'  or  'plus  p'  to 
indicate  the  way  they  bias  the  perception.  In  this  case,  as  in  the  others  we 
described  earlier,  we  see  how  a phonetic  distinction  can  be  produced  by  either 
of  two  cues,  one  spectral,  the  other  temporal.  In  Pairs  III  and  IV,  both  the 
spectral  and  temporal  cues  differ  between  the  members  of  the  pair,  but  in 
different  ways.  In  the  one  case  (Pair  III),  the  combination  enhances  the 
perceived  difference,  while  in  the  other  (Pair  IV)  it  permits  the  minus  and 
plus  biases  to  summate  (algebraically)  so  as  to  produce  two  percepts  [split] 
which  are  the  same  or  very  little  different  (Liberman  and  Pisoni,  in  press). 

To  appreciate  the  significance  of  the  perceptual  addition  exemplified  in 
Figure  6,  we  should  think  of  it  as  a paradigm  for  comparative  studies  with 
nonhuman  animals.  Those  would  be  enlightening  because  phonetic  perception, 
and  the  algebraic  summation  that  goes  with  it,  exist  presumably  only  in 

creatures  that  speak.  Others  would  perceive  the  stimuli  of  Figure  6 in  an 
auditory  way.  Hence,  they  should  find  the  pairs  that  differ  by  two  cues  (ill 
and  IV)  to  be  more  d iscr iminable  that  those  (I  and  II)  that  differ  only  by 
one,  and,  further,  the  pairs  with  two-cue  differences  should  be  almost  equally 
discr iminable . Note,  incidentally,  how  relatively  easy  it  would  be  to  test 
that  expectation,  not  only  with  animals  but  with  human  infants:  the  measure — 

relative  difficulty  of  discrimination — is  surely  one  of  the  easiest  to  make, 
and  the  order  of  difficulty  to  be  expected  from  nonhuman  animals  is  very 

different  from  that  already  obtained  with  us  human  beings. 

Njonj^E a_l_e e_ _ i r> _ ejt  i.c__ Jje_rcejgt_ion  of  the  same  or  similar 

acojust ic __cu_es . Just  as  the  processes  of  speech  production  cause  different 
acoustic  cues  to  be  correlated  in  articulation  and  (hence)  equivalent  in 
perception,  so  also,  if  in  a somewhat  more  complex  way,  do  they  sometimes 
cause  the  same  cue  to  be  uncorrelated  in  articulation  and  (hence)  different  in 
perception.  An  early  instance  of  this  was  seen  in  the  first  'synthetic' 
experiment  on  the  stops,  where  it  was  found  that  a burst  centered  at  1440  Hz 

was  perceived  differently  in  front  of  different  vowels  (Liberman,  et  al . , 

1952).  Subsequently,  much  the  same  effect  was  found  with  real  speech  (Schatz, 
1954).  More  recently,  the  general  effect  has  been  confirmed,  though  with 
better  methods  for  controlling  the  stimuli,  but  now  it  is  seen  that  the  exact 
nature  of  the  effect  varies  somewhat  depending  on  just  how  much  of  the  'real' 
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Figure  6:  Diagrams  that  illustrate  how  spectral  and  temporal  cues  separately 

produce  the  same  phonetic  distinction  (Pairs  1 and  II)  and  how, 
taken  together,  they  either  enhance  that  distinction  or  reduce  it 
(Pairs  III  and  IV). 


burst  is  used  and  just  where  it  is  placed  in  time  with  reference  to  the 
vowel . 1 i 

Another  example  concerns  silence,  'about  which  we  have  already  heard  so 
much.  Having  seen  earlier  that  it  is  a cue  for  the  perception  of  phonetic 
segments,  we  should  note  now  that  it  is  effective  in  regard  to  all  three 

phonetic  dimensions:  manner,  voicing,  and  place.  In  connection  with  manner, 

we  should  remember  that  an  appropriate  amount  of  silence,  placed  between  the 
noise  of  a fricative  and  a vocalic  piece  of  sound,  will  produce  the  perception 
of  a stop  consonant,  the  perceived  'place'  of  the  stop  depending  on  the  nature 
of  the  vocalic  section.  We  should  also  remember  that,  in  similar  fashion, 
silence  will  produce  the  affricate  manner  when  introduced  appropriately 
between,  for  example,  the  word  'say'  and  the  word  'shop'.  In  regard  to 

voicing,  we  saw  earlier  that  variations  in  the  duration  of  intersyllable 
silence  will  convert  a poststress  voiced  stop  (as  in  'rabid')  into  voiceless 

(as  in  'rapid'),  and  vice  versa.  Now  we  turn  to  the  dimension  of  place,  and 

point  out,  as  we  had  not  before,  that  in  a disyliable  like  'rabid' , reductions 
in  the  duration  of  intersyllable  silence  will  cause  the  listener  to  hear 
'ratid' — that  is,  a stop  with  a different  place  of  production  (Port,  1976). 
This  perceptual  change  correlates — not  accidentally,  according  to  our  hypo- 
thesis— with  the  fact  that  a speaker  closes  his  vocal  tract  for  a shorter  time 
when  he  says  'ratid'  than  when  he  says  'rabid'.  Given  the  utterance  'rabid,' 
and  an  artificially  shortened  silence  between  the  syllables,  it  is  as  if  the 
listener  heard  'ratid'  because  his  phonetic  perceiver  knows  that  the  speaker 
could  not  have  said  'rabid'  since  he  did  not  close  his  vocal  tract  long 
enough . In  sum,  then,  a single  acoustic  dimension,  duration  of  silence, 
produces  contrasts  on  each  of  three  phonetic  and  perceptual  dimensions — 
manner,  voicing,  and  place.  That  curious  situation  arises  because  the  very 
different  kinds  of  articulations — indeed,  the  different  sets  of  muscles — that 
underlie  the  independence  of  those  dimensions  in  the  phonetic  and  perceptual 
domains  happen  to  converge  on  a single  acoustic  dimension. 

Perhaps  the  reader  will  have  noticed  that  we  did  not  specify  the  amounts 
of  silence  that  are  appropriate  in  the  aforementioned  cases,  and  he  will  quite 
naturally  wonder  if  they  are  within  the  same  ranges  for  the  three  phonetic  and 
perceptual  dimensions.  We  did  not  specify  because  the  appropriate  durations 
vary  according  to  how  the  other  relevant  cues  are  set,  and  much  of  this 
remains  to  be  worked  out.  It  is  reasonably  clear,  even  now,  that  the 
durations  of  silence  for  manner  and  voicing  overlap  greatly.  For  place,  there 
probably  is  some  overlap  with  voicing,  depending  on  just  how  the  other  cues 
are  set,  but  at  this  moment  the  relevant  data  have  not  been  gathered. 

Since  we  have,  up  to  this  point,  looked  only  at  the  segmental  aspects  of 
phonetic  structure,  it  may  seem  inappropriate  that  we  should  now  broaden  our 
view  to  glimpse  those  other  aspects  that  pertain  to  prosody  and  syntax.  But 
the  temptation  to  do  so  is  great  because  there  is,  at  just  this  juncture,  a 
very  natural  and  interesting  connection.  The  point  is  that  the  duration  of  a 
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syllable  conveys  information  not  only  about  the  identity  of  the  phonetic 
segments  it  comprises,  as  we  have  already  seen,  but  also  about  the  tempo  (rate 
of  articulation),  degree  of  linguistic  stress,  and  position  in  the  syntactic 
frame.  We  do  not  know  how  the  separate  contributions  to  duration  are  sorted 
out  in  perception,  but,  as  Klatt  (1976)  has  pointed  out,  considerations  of 
simple  logic  suggest  that  the  perceiver  can  hardly  arrive  at  his  decisions  in 
some  particular  order,  one  at  a time,  since  each  decision  would  appear  to 
depend  on  every  other  one.  In  any  case,  it  does  appear  that,  in  production, 
these  several  aspects  of  the  message  are  encoded  into  the  same  aspect  of  the 
signal,  and  then,  in  perception,  properly  recovered. 

IS  PHONETIC  PERCEPTION  NECESSARY? 

So  far  we  have  assumed  the  perceptual  reality  of  phone-size  segments.  In 
this  final  section  we  propose  to  justify  that  assumption. 

Prel iminary 

No  one,  of  course,  doubts  that  linguistic  utterances  are  perceived  as 
sequences  of  word-like,  or  morphemic,  segments.  But  the  processes  by  which 
these  segments  are  extracted  from  the  acoustic  signal  are  far  from  certain. 
Do  we,  in  perceiving  speech,  pass  directly  from  the  overall  acoustic  shape  of 
the  constituent  morphemes  to  their  syntactic  and  semantic  attributes,  or  do 
we,  rather,  first  analyze  at  least  some  portion  of  each  utterance  (with  the 
possible  exception  of  nonproposit ional  greetings,  interjections,  and  exple- 
tives) into  its  phonological  components,  and  only  then  proceed  to  syntax  and 
meaning?  Certainly,  the  phonological  attributes  of  each  morpheme  are  availa- 
ble to  consciousness.  But  what  is  the  form  that  gives  access  to  the 
listener's  lexicon?  Is  lexical  storage  analog  and  isomorphic  with  gross 
auditory  shape,  or  is  it  digital  and  isomorphic  with  phonological  structure? 

We  should  make  clear  from  the  outset  that  the  precise  form  of  any 
possible  morphemic  sound  pattern  is  not  our  concern.  We  have  spoken  until  now 
of  phones,  since  we  take  a phonetic  representation  to  be  at  the  first  remove 
from  the  auditory  signal  and  to  be  perceptually  available,  even  though  not 
attended  to  in  normal  listening.  However,  for  the  present  discussion,  it  is  a 
matter  of  indifference  whether  the  representation  is  assumed  to  be  a feature 
matrix,  a sequence  of  phones,  or  a sequence  of  more  abstract  phonologic 
segments.  Our  only  concern  is  whether  the  form  is  segmented  or  unsegmented. 

Evidence  Against  Segments  Smaller  than  Syllables 

Consider,  first,  the  grounds  for  believing  the  perceptual  representation 
to  be  unsegmented.  Foremost  is  the  fact,  to  which  we  have  repeatedly  alluded, 
that  phonetic  segments  are  not  discretely  arrayed  in  time,  as  are  letters  of 
the  alphabet  in  space,  but  are,  rather,  transmitted  simultaneously  or  with 
considerable  shingling.  This  fact  alone  has  led  some  students  to  abandon  the 
phone  as  a perceptual  unit  in  favor  of  the  context-sensitive  allophone 
(Wickelgren,  1969;  but  see  Halwes  and  Jenkins,  1971)  or  the  syllable  (Massaro, 
1972;  Warren,  1976a;  1976b). 


A second  line  of  argument  draws  on  reaction-time  studies  demonst rat ing 
that  listeners,  asked  to  monitor  a word-list  or  sentence,  display  successively 
shorter  reaction  times  as  the  target  item  increases  in  duration  from  phone  to 
syllable  to  word  (Savin  and  Bever , 1970)  and  even  to  sentence  (Bever,  1970), 
suggesting  a perceptual  progression  from  larger  unit  to  smaller  rather  than 
the  reverse.  The  solution  to  this  paradox  was  provided  by  McNeil  and  Lindig 
(1973),  who  showed  that  reaction  times  are,  in  fact,  shortest  for  the  items  of 
which  a list  is  composed  or,  in  other  words ; for  those  items  to  which  the 
experimental  situation  has  drawn  the  listener’s  attention.  (See  also  Foss  and 
Swinney,  1973.)  Rubin,  Turvey  and  van  Gelder  (1976),  have  elaborated  these 
conclusions,  arguing  that  such  monitoring  experiments  do  not  measure  the  time 
taken  to  process  the  targets  perceptually,  but  rather  the  time  taken  to  bring 
them  into  consciousness  (cf.  Studdert-Kennedy , 1974,  p.  2366).  This  does  not, 
of  course,  preclude  the  possibility  that  normal  processing  entails  unconscious 
access  to  the  lexicon  through  phonological  analysis,  since,  in  all  likelihood, 
these  experiments  have  no  bearing  on  normal  perceptual  processes  at  all. 
However,  it  does  invite  the  reflection  that  the  several  attributes  of  a 
morpheme — its  phonological  components,  syllabic  structure,  syntactic  and  se- 
mantic markers — may  all  be  simultaneously  available  to  the  listener,  once 
access  to  his  lexicon  has  been  granted  by  overall  acoustic  (or,  in  reading, 
visual)  shape  (Warren,  1976b). 

Finally,  a broad  line  of  argument  springs  from  the  suspicion  that  the 
study  of  speech  perception  has  been  tied  to  the  isolated  syllable  and  its 
components  at  the  expense  of  attention  to  the  overall  acoustic  pattern  of 
running  speech.  This  overall  pattern,  or  prosody,  certainly  conveys  important 
information.  Svensson  (1974),  for  example,  has  shown  that  the  perceived  form 
of  hummed  speech  (that  is,  speech  lacking  all  the  acoustic  cues  for  its 
phonetic  segments)  is  often  syntactically  correct.  Martin  (1972,  1975)  has 
argued  that  speech  rhythm  may  enable  listeners  to  predict  upcoming  stresses. 
And  Darwin  (1975)  has  even  induced  listeners  to  reveal  a preference  in  some 
circumstances  for  gcod  prosody  over  good  syntax  and  meaning.  These  and  other 
studies  (for  example,  Cohen  and  Nooteboom,  1975,  passim)  do  suggest  that  the 
role  of  prosody  in  speech  perception  may  have  been  underestimated.  In  fact, 
if  we  combine  these  studies  with  recent  work  on  possible  invariant  acoustic 
correlates  of  distinctive  features  in  the  speech  stream  (Stevens,  1975),  we 
may  be  tempted  to  propose  once  again  the  "novel  theory  of  speech  perception," 
first  put  forward  by  Chomsky  and  Miller  ( 1963,  p.  311)  and  elaborated  by 
Chomsky  and  Halle  (1968,  p.  24),  by  which  a few  more  or  less  invariant 
acoustic  properties  give  the  listener  access  to  his  lexicon  and  so  precipitate 
a plausible  syntactic  and  semantic  analysis  of  an  utterance. 

In  short,  a fair  body  of  evidence  suggests  that  the  acoustic  structure  of 
spoken  utterances  may  be  sufficient  to  access  the  listener's  lexicon,  or  at 
least  his  syllabary,  without  an  intermediate  stage  of  phonological  analvsis. 
However,  we  do  not  believe  that  this  view  is  correct  and  in  the  following 
sections  we  will  try  to  explain  why. 


Evidence  in  Favor  of  Segment s Smal 1 er  than  the  Syllable 

Experimental  evidence ■ There  is  a great  weight  of  evidence  for  the 

psychological  reality  of  every  level  of  phonetic  analysis,  from  feature  to 
phone  to  syllable.  We  have  reviewed  much  of  this  evidence  elsewhere  (Studdert- 
Kennedy,  1976).  Here  we  do  no  more  than  remark  that  studies  of  speaking 
errors  (Boomer  and  Laver,  1968;  Fromkin,  1971),  perceptual  confusions  (Miller 
and  Nicely,  1955;  Mitchell,  1973),  synthetic  speech  continua  (Liberman,  et 
al.,  1967),  dichotic  listening  (Shankweiler  and  Studdert-Kennedy , 1967;  Stud- 
dert-Kennedy  and  Shankweiler,  1970)  and  "verbal  transformations"  (Goldstein 
and  Lackner,  1973;  Warren,  1976a)  leave  little  room  for  doubt  that  both  phones 
and  features  have  some  form  of  psychological  reality.  To  this  experimental 
evidence  we  may  add  the  testimony  of  linguistic  analysis  (for  example, 
Gleason,  1955),  including  studies  of  language  change  (for  example,  Lehman, 
1975),  not  to  mention  the  very  existence  of  alphabetic  writing. 

The  structure  of  the  syllable.  As  the  etymology  of  its  name  implies,  the 
syllable  is  a compound,  the  vehicle  of  a natural  acoustic  contrast  between 
consonant  constriction  and  vowel  opening,  a contrast  frequently  claimed  as  a 
phonological  universal  (for  example,  Postal,  1968).  The  contrast  is  clearly 
reflected  in  perception,  as  evidenced  by  a long  series  of  studies  over  the 
past  fifteen  years.  These  studies,  employing  a variety  of  experimental 
paradigms — identification  and  discrimination  of  synthetic  speech  sounds, 
short-term  memory,  reaction  time,  dichotic  listening,  backward  masking  and 
others — converge  on  the  conclusion  that  consonants  and  vowels  perform  distinct 
perceptual  functions.  Once  again,  we  have  reviewed  this  matter  elsewhere 
(Studdert-Kennedy,  1975b,  1976)  and  will  not  do  so  here.  We  simply  remark 
that  none  of  the  varied  evidence  for  the  perceptual  contrast  between 
consonants  and  vowels  could  exist  if  the  syllable  were  not  analyzed  in 
percept  ion . 

We  may  note,  in  passing,  one  further  point.  In  all  languages,  the 
syllable  is  the  unit  of  poetic  meter.  Except  where  syllable  and  morpheme 
normally  coincide  (for  example,  Japanese  haiku),  metrical  rules  are  specified 
in  terms  of  syllables  and  their  expected  'length'  or  degree  of  stress.  Of 
particular  interest  in  the  present  context  is  the  fact  that  length  is 
frequently  specified  not  by  lexical  form  but  by  neighboring  phonetic  segments. 
In  both  Latin  and  ancient  Greek  verse,  for  example,  the  length  assigned  to  a 
word-final  CVC  syllable  varies  as  a function  of  the  initial  phone  of  the 
following  word.  Thus  a famous  Horace  Ode  (Book  III,  Ode  XXVI)  begins:  "Vixi 

puellis  nuper  idoneus..."  Here  the  third  word  scans  as  a trochee  because  the 
following  word  begins  with  a vowel,  but  would  have  scanned  as  a spondee  had 
the  following  word  begun  with  a consonant.  This  simple  rule  of  ancient  verse 
obviously  required  that  the  singer  (and  presumably  the  listener  who  could 
detect  a singer's  error)  be  aware  of  the  phonological  structure  of  the 
syllable. 

The  perceptual  funct ion  of  phonological  categories . A crucial  process  in 
the  perception  of  fluent  speech  must  be  short-term  storage  of  early  portions 
of  an  utterance  pending  final  interpretation.  What  is  the  form  of  this  store? 
Clearly  it  cannot  be  simply  auditory,  since  a precategor ical  auditory  store 
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(Crowder  and  Morton,  1969;  Crowder,  1972)  is  sensitive  to  overwriting  from 
immediately  following  items  (Crowder,  1971).  Nor,  given  our  sensitivity  to 
phonetic  structure  (most  obviously  in  listening  to  poetry)  can  the  store  be 
purely  functional  or  semantic,  with  all  phonetic  detail  stripped  away. 

In  fact,  we  wish  to  argue  that,  to  fulfill  this  linguistic  function,  a 
general  perceptual  process  is  invoked,  namely,  division  into  'stages.'  Among 
the  likely  functions  of  'perceptual  stages' — whether  defined  in  time  or  in 
neural  locus — is  to  isolate  one  process  from  another,  and  to  store  energy  or 
information  for  later  use.  We  may  see  this  most  clearly  at  the  periphery. 
Every  sensory  system  integrates  energy:  if  the  system  were  infinitely  damped, 

threshold  for  activation  would  never  be  reached.  Accumulation  of  energy  over 
some  finite  period  permits  the  mechanical  response  of  the  ear,  for  example,  to 
develop.  On  the  other  hand,  the  period  of  integration  must  be  finite  to 
prevent  physical  destruction  of  the  system:  mechanical  energy  becomes  bioe- 

lectricity. Analogous  cycles  of  integration  and  transformation  presumably 
recur,  as  energy  or  information  progresses  through  the  system.  Activity  in 
afferent  fibers  gives  rise  to  more  central  neural  activity  and,  ultimately 
(jumping  levels  of  discourse),  to  a preperceptual  'image'  (riassaro,  1972). 
The  'image',  in  turn,  must  have  some  finite  duration,  long  enough  to  institute 
further  processing,  short  enough  to  prevent  'babble.' 

Returning  with  this  metaphor  to  language,  we  note  that  speech  is  arrayed 
in  time,  and  that  both  syntax  and  meaning  demand  some  minimum  quantity  of 
information  before  linguistic  structure  can  emerge.  The  perceptual  function 
of  phonological  categories  may  then  be,  on  the  one  hand,  to  forestall  auditory 
babble,  on  the  other,  to  store  information  derived  from  the  signal  until  such 
time  as  it  can  be  granted  a linguistic  interpretation.  In  other  words,  the 
perceptual  function  of  phonological  categories  is  that  of  a buffer  between 
acoustic  signal  and  meaningful  message. 

Recovery  of  the  morpheme.  We  come,  finally,  to  the  phonological  function 
without  which,  we  believe,  linguistic  commun icat ion  would  not  be  possible, 
namely,  to  provide  a code  for  lexical  storage. 

Notice  first  that  if  lexical  items  are  coded  according  to  overall 
acoustic  structure,  the  form  must  be  sufficiently  stylized,  stripped  of 
acoustic  detail,  for  the  word  to  be  accessed,  despite  a wide  variety  of 
surface  forms.  For  example,  the  duration  of  a single  monosyllabic  word, 
spoken  by  a single  speaker  at  a conversational  rate  in  a random  list  or  in  a 
sentence,  may  vary  by  a factor  of  2 to  1 (Gaitenby,  1965;  Kozhevnikov  and 
Chistovich,  1965;  Lackner  and  Levine,  1975),  and  yet  be  fully  intelligible  in 
both  contexts.  Furthermore,  the  durational  variants  are  not  related  by  a 
simple  scale-factor:  most  of  the  variation  occurs  over  the  syllable  nucleus 

rather  than  over  its  edges  (Gaitenby,  1965;  Lehiste,  1970;  Huggins,  1972),  so 
that  an  algorithm  for  generalizing  two  extreme  acoustic  variants  could  hardly 
succeed  without  at  least  some  analysis  of  the  overall  acoustic  shape. 

If  we  add  to  durational  variations,  other  wi th in-speaker  variations  in 
fundamental  frequency  (which,  coupled  with  duration,  is  the  primary  acoustic 
correlate  of  variations  in  linguistic  stress)  and  in  formant  structure  (due  to 
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cross-morphemic  effects  of  coarticulation  in  running  speech),  not  to  mention 
acoustically  similar  ac ross-speaker  variations  due  to  age,  sex  and  dialect,  we 
are  confronted  with  a formidable  array  of  acoustic  forms  each  of  which — if 

unanalyzed  acoustic  structure  is  to  give  access  to  the  lexicon — will  have  to 
be  reduced  to  canonical  acoustic  form. 

Now,  it  is  true  that  the  invariance  problem  is  scarcely  less  serious  if 
the  message  units  to  be  recovered  from  the  signal  are  phonological  entities 
such  as  features  or  phones  than  if  they  are  morphemes  or  words,  and  even  a 
cursory  survey  of  the  literature  of  speech  perception  will  show  that,  as  in 
the  earlier  sections  of  our  chapter,  this  is  a recurrent  preoccupation. 
However,  we  should  note  that  the  'audile'  listener,  consigned  to  lexical 
search  with  nothing  but  overall  acoustic  shape  (and  a few  syntactic-semantic 
hints  derived  from  prosody  and  context)  to  guide  him,  is  deprived  of  at  least 
one  valuable  aid,  namely  the  systematic  phonological  and  phonotactic 

constraints  of  his  language.  He  will  not  be  permitted  to  resolve  uncertainty 
by  drawing  on  his  knowledge  that  a particular  portion  of  the  acoustic  pattern, 
or  a particular  sequence  of  acoustic  segments,  cannot  occur  in  his  language. 
Rather,  every  morphemic  sound  pattern  will  be  distinct,  and  access  to  its 
semantic  and  syntactic  attributes  will  be  direct.  In  other  words,  the  vast 
and  subtle  array  of  systematic  phonology  that  linguistic  studies  have  brought 
into  view  over  the  past  one  hundred  and  fifty  years  will  be  no  more  than 

epiphenomenal  froth,  communicatively  vacuous,  at  least  for  the  listener,  if 

not  for  the  speaker. 


Nonetheless,  let  us  set  the  problem  of  invariance  aside.  Let  us  assume, 
for  the  moment,  that  it  has  been  solved  and  that  we  are  able  to  specify  for 
every  word  or  morpheme  a unique  canonical  acoustic  form  apt  for  every  context 
and  every  speaker.  We  shall  then  be  confronted  with  the  deeper  problem  of  how 
the  listener  segments  an  utterance  into  its  constituent  morphemes  or  words. 

The  heart  of  the  problem  is  simply  that  speakers  freely  coarticulate 
across  word  and  morpheme  boundaries.  A consequence  is  that  dividing  the 
speech  stream  by  use  of  an  acoustic  (or  auditory)  criterion  will  yield 
segments  that  bear  a random  relation  (in  size)  to  the  words  or  morphemes.  In 
that  circumstance,  the  audile  listener  would  have  to  store,  not  merely  the 
20,000  or  30,000  canonical  auditory  patterns  that  would  represent  the  words  in 
his  vocabulary,  but  rather  a number  unimaginably  greater  than  that  (Liberman 
and  Pisoni,  in  press).  Even  if  he  had  a reliable  acoustic  criterion  for 
dividing  an  utterance  into  syllables  (see  Mermelstein,  1975),  he  would  not  be 
able  to  assign  the  syllables  to  their  appropriate  morphemes  without  analyzing 
them  into  their  phonetic  segments.  For  example,  syllabification  of  the  simple 
phrase,  "He’s  a repeated  offender,”  will  yield  eight  CV  syllables,  four  of 
which  cross  morpheme  boundaries  and  two  of  which  cross  word  boundaries.  In 
other  words,  syllable  boundaries  in  fluent  speech  are  frequentlv  random  with 
respect  to  words  or  morphemes. 

The  problem  is  exacerbated  for  inflectional  languages  where  changes  in  a 
single  phoneme  (initial,  medial  or  final)  often  suffice  to  signal  changes  in 
tense,  mood,  person,  number  or  case.  Simple  suffix  changes,  such  as  English 
plurals,  might  pose  no  problem  for  the  ’audile’  listener,  despite  the  lawful 
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[s],  [ z]  and  [^z]  alternations,  and  the  absence  of  an  acoustically  marked 

morpheme  boundary,  for  we  need  only  suppose  that  the  perceptual  'morpheme 
detector'  is  automatically  sprung  as  soon  as  a recognizable  acoustic  unit 
enters  the  system.  We  might  even  suppose  that  tense  contrasts  signalled  by  a 
change  in  medial  vowel  (as  in  'win'-'won')  are  learned  as  special  cases.  But 
we  will  find  it  a good  deal  more  difficult  to  explain,  for  example,  the 
formation  of  the  Greek  perfect  tense  by  duplication  of  the  initial  consonant 
of  the  present,  a fact  presumably  not  lost  on  the  listener.  Indeed,  as  we 
multiply  examples  (and  ad  hoc  solutions  for  the  imaginary  'audile'  listener), 
we  cannot  but  wonder  why  the  various  forms  of  a lexical  item  bear  any  relation 
to  one  another  at  all.  Are  we  to  suppose  that  these  variations  are  lawful  for 
the  speaker,  but  merely  adventitious  for  the  listener? 

Surely  not.  For,  quite  apart  from  the  general  lack  of  parsimony  in 
positing  totally  independent  input  and  output  lexicons,  we  would  be  reduced  to 
the  absurdity  of  supposing  that  a listener  consults  a lexicon  of  auditory 
segments  that  bear  no  more  than  a random  relation  to  the  articulatory  segments 
he  deploys  as  a speaker.  We  are  forced  to  conclude  that  only  by  extracting 
the  phonetic  segments — or,  more  properly,  their  underlying  phonological  forms — 
can  the  listener  discover  most  of  what  is  said  to  him. 

REFERENCES 


Ainsworth,  W.  A.  (1976)  Automatic  speech  recognition.  In  Mechanisms  of 
Speech  Recognition.  (Oxford,  Eng.:  Pergamon  Press) , 104-119. 

Ades,  A.  E.  (in  press)  Source  assignment  and  feature  extraction  in  speech. 
J . Exp . Psychol . : Human  Percept . Perform . 

Ades,  A.  E.  (1976)  Adapting  the  property  detectors  for  speech  perception. 
In  New  Approaches  to  Language  Mechanisms , ed . by  E.  C.  T.  Walker  and 
R.  J.  Wales.  (Amsterdam:  North  Holland). 

Bailey,  P.  J.  (1973)  Perceptual  adaptation  for  acoustical  features  in 
speech.  Speech  Percept  ion , Progress  Report.  (Department  of  Psychology, 
The  Queen's  University  of  Belfast),  2.2,  29-34. 

Bailey,  P.  J.,  M.  F.  Dorman  and  A.  Q.  Summerfield.  (1977)  Identification  of 
sine-wave  analogues  of  CV  syllables  in  speech  and  non-speech  modes. 
J . Acoust ■ Soc ■ Am ■ 6 1 , S(A) . 

Bever,  T.  G.  ( 1970)  The  influence  of  speech  performance  on  linguistic 
structure.  In  Advances  in  Psycholinguistics , ed.  by  G.  B.  Flores 
D'Arcais  and  W.  J.  M.  Levett.  (Amsterdam:  North  Holland). 

Boomer,  D.  S.  and  J.  D.  M.  Laver.  (1968)  Slips  of  the  tongue.  Brit. 

J . Pis  ■ Communic  . j3,  1-12. 

Chomsky,  N.  and  M.  Halle.  (1968)  The  Sound  Pattern  of  English . (New  York: 
Harper  & Row)  . 

Chomsky,  N.  and  G.  A.  Miller.  (1963)  Introduction  to  the  formal  analysis  of 
natural  languages.  In  Handbook  of  Mathematical  Psychology , vol . I, 
ed . by  R.  D.  Luce,  R.  R.  Bush  and  E.  E.  Galauker.  (New  York:  Wiley), 

269-321  . 

Cohen,  A.  and  S.  G.  Nootebooin,  eds.  ( 1975)  St  ructure  and  Process  in  Speech 
Percept  ion ■ (New  York:  Spr inger-Verl ag)  . 

Coker,  C. , N.  Umeda  and  C.  P.  Browman . ( 1973)  Automatic  synthesis  from 

ordinary  Engl ish  text.  IEEE  Trans.  Audio  Electroacoust  ■ 21  , 293-298. 


54 


i wmmmmmmm m 


Cole,  R.  A.  and  B.  Scott.  (1974)  Toward  a theory  of  speech  perception. 
Psych.  Rev.  81 , 348-374. 

Cooper,  F.  S.  (1950)  Spectrum  analysis.  J.  Aroust.  Soc.  Amer.  22 , 761-762. 

Cooper,  F.  S.  (1953)  Some  instrumental  aids  to  research  on  speech.  In 

Proceedings  of  the  Fourth  Annual  Round  Table  Meeting  on  Linguist ics  and 
Language  Teaching.  (Washington:  Georgetown  Univ.),  46-53. 

Cooper,  F.  S.  ( 1962)  Speech  synthesizers.  In  Proceedings  of  the  Fourth 

International  Congress  of  Phonetic  Sciences , ed . by  A.  Sovijatvi  and 
D.  Aalto.  (The  Hague:  Mouton) , pp . 3-13. 

Cooper,  F.  S.  (1972)  How  is  language  conveyed  by  speech?  In  Language  by  Ear 
and  by  Eye , ed  . by  J.  F.  Kavanagh  and  I.  G.  Mattingly”!  ( Cambridge , 
Mass.:  MIT  Press),  25-45. 

Cooper,  F.  S.,  P.  C.  Delattre,  A.  M.  Liberman,  J.  M.  Borst  and  L.  J.  Gerstman. 
(1952)  Some  experiments  on  the  perception  of  synthetic  speech  sounds. 
J . Acoust . Soc . Am.  24^,  597-606. 

Cooper,  F.  S.,  A.  M.  Liberman,  and  J.  M.  Borst.  (1951)  The  interconversion 
of  audible  and  visible  patterns  as  a basis  for  research  in  the  perception 
of  speech.  Proceedings  of  the  Nat ional  Academy  of  Sciences  37 , 318-325, 

Cooper,  W.  E.  (1975)  Selective  adaptation  to  speech.  In  Cognitive  Theory , 
vol . 1,  ed . by  F.  Restle,  R.  M.  Shiffrin,  J.  N.  Castellan,  H.  Lindman  and 
D.  B.  Pisoni.  (Hillsdale,  N.J.:  Erlbaum  Associates),  23-54. 

Crowder,  R.  G.  (1971)  Waiting  for  the  stimulus  suffix:  Decay,  delay,  rhythm 

and  readvent  in  immediate  memory.  Quart . J^.  Exp . Psychol ■ 23 , 324-340. 

Crowder,  R.  G.  (1972)  Visual  and  auditory  memory.  In  Language  by  Ear  and  by 
Eye , ed . by  J.  F.  Kavanagh  and  I.  G.  Mattingly.  (Cambridge:  MIT  Press). 

Crowder,  R.  G.  and  J.  Morton.  (1969)  Precategorical  acoustic  storage  (PAS). 
Percept.  Psychophys.  365-373  . 

Darwin,  C.  J.  (1971)  Ear  differences  in  the  recall  of  fricatives  and  vowels. 
Quart ■ J . Exp.  Psychol . 23 , 46-62. 

Darwin,  C.  J.  (1975)  On  the  dynamic  use  of  prosody  in  speech  perception.  In 
Structure  and  Process  in  Speech  Perception , ed . by  A.  Cohen  and  J. 
G.  Nooteboom.  (New  York  : Spr inger-Verlag)  , T78-194. 

Darwin,  C.  J.  (1976)  The  Perception  of  Speech.  In  Handbook  of  Percept  ion , 
vol.  7,  ed . by  E.  Carterette  and  M.  Friedman.  (New  York:  Academic 

Press),  pp . 175-226. 

Day,  R.  S.  (1970)  Temporal  order  perception  of  reversible  phoneme  cluster. 
J . Acoust . Soc  ■ Am.  48^,  95(A)  . 

Delattre,  P.  C.,  A.  M.  Liberman  and  F.  S.  Cooper.  (1955)  Acoustic  loci  and 
transitional  cues  for  consonants.  J.  Acoust.  Soc.  Am.  27 , 769-773. 

Denes,  P.  B.  (1965)  Effect  of  durat  ion  on  the  perception  of  voicing. 
J ■ Acoust . Soc . Am.  2_7_,  761-764. 

Dorman,  M.,  J.  E.  Cutting  and  L.  J.  Raphael,  (1975)  Perception  of  temporal 
order  in  vowel  sequences  with  and  without  formant  transitions. 
— E-XP  ■ P sychol  , : Human  Percept  ■ Per  for  . 104  , 121-129  . 

Dorman,  M.  F.,  L.  J.  Raphael  and  A.  M.  Liberman.  (1976)  Further  observations 
on  the  role  of  silence  in  the  perception  of  stop  consonants.  Haskins 
Laboratories  Status  Report  on  Speech  Research  SR-48 , 199-207. 

Dorman,  M.  F.,  L.  J.  Raphael,  A.  M.  Liberman  and  B.  Repp.  ( 1975)  Maskinglike 
phenomena  in  speech  perception.  J_;_  Acoust  . Soc . Am.  5_7,  Suppl  . j_, 

S48(A).  [Full  text  in  Haskins  Laboratories  St  atus  Report  on  Speech 
Research  SR-42/43,  265-276.] 


55 


Dorman,  M. , M.  Studdert-Kennedy  and  L.  J.  Raphael.  (in  press)  Stop  consonant 
recognition:  Release  bursts  and  formant  transitions  as  functionally 

equivalent,  context-dependent  cues.  Percept.  Psychophys. 

Eimas,  P.  D.  and  J.  D.  Corbit.  (1973)  Selective  adaptation  of  linguistic 
feature  detectors.  Cog . Psychol  . 1 3 , 247-252. 

Erickson,  D.  M.,  H.  L.  Fitch,  T.  G.  Halwes  and  A.  M.  Liberman.  (1977)  Trading 
relation  in  perception  between  silence  and  spectrum.  J . Acoust . 
Soc . Am . 6 1 , S46(A). 

Fant,  C.  G.  M.  (1962)  Descriptive  analysis  of  the  acoustic  aspects  of 

speech.  Logos  5,  3-17. 

Fant,  C.  G.  M.  ( 1963)  Analysis  and  synthesis  of  speech  processes.  In  Manual 
of  Phonetics,  ed . by  B.  Malmberg.  (Amsterdam:  North-Holland)  , pp . 173  — 

277T 

Foss,  D.  J.  and  D.  A.  Swinney . (1973)  On  the  psychological  reality  of  the 

phoneme:  perception,  identification  and  consciousness.  J . Verbal 

Learn . Verbal  Behav . 1 2 , 246-257. 

Fromkin,  V.  a]  TI 971)  The  non-anomalous  nature  of  anomalous  utterances. 

Language  47,  27-52. 

Fujimura,  0.  TTn  press)  A look  into  the  effects  of  context:  some  articula- 

tory and  perceptual  findings.  In  Proceedings  o f VUIth  Internat  ional 
Congress  of  Phonet ic  Sciences , (Leeds,  1975). 

Fujimura,  0.  (1975)  Syllable  as  a unit  of  speech  recognition.  IEEE 

Trans , Acoust , Sp . S ig  . Proc  , ASSP-23 , 82-87  . 

Gaitenby,  J.  H.  (1965)  The  elastic  word.  Haskins  Laboratories  Status 
Report  on  Speech  Research  SR/2 , 3.1. 

Ganong,  W.  F.  (1975)  An  experiment  on  "phonetic  adaptation."  Progress  Report 
(Research  Laboratory  of  Electronics,  MIT)  116,  206-210. 

Gleason,  H.  A.,  Jr.  (1955)  Workbook  in  Descript ive  Linguistics.  (New  York: 
Holt,  Rinehart  & Winston). 

Goldstein,  L.  M.  and  J.  R.  Lackner.  (1973)  Alterations  of  the  phonetic  coding 
of  speech  sounds  during  repetition.  Cognition  2,  279-297. 

Haggard,  M,  P.,  S.  Ambler,  and  M.  Callow]  ( 1970)  Pitch  as  a voicing  cue. 
J . Acoust . Soc . Am . 47,  613-617. 

Halwes,  T.  and  J.  J.  Jenkins.  (1971)  Problem  of  serial  order  in  behavior  is 
not  resolved  by  context-sensitive  associative  memory  models.  Psych. 
Rev.  78,  122-129. 

Harris,  C.  M.  (1953)  A study  of  the  building  blocks  of  speech. 
J . Acoust . Soc  ■ Am.  15_,  962-969. 

Harris,  K.  S.  (1958)  Cues  for  the  discrimination  of  American  English  frica- 
tives in  spoken  syllables.  Lang . Speech  1 , 1-7. 

Harris,  K.  S.,  H.  S.  Hoffman,  A.  M.  Liberman,  P.  C.  Delattre,  and  F.  S. 
Cooper.  (1958)  Effect  of  third-formant  transitions  on  the  perception  of 
the  voiced  stop  consonants.  J.  Acoust.  Soc,  Am.  30,  122-126. 

Hoffman,  H.  S.  ( 1958)  Study  of  some  cues  in  the  perception  of  the  voiced 
stop  consonants.  J^  Acoust . Soc ■ Am.  30,  1035-1041 . 

Huggins,  A.  W.  F.  (1972)  On  the  perception  of  temporal  phenomena  in  speech. 
J . Acoust  . Soc . Am.  5J_,  1279-1290. 

Hyde,  S.  R.  (1972)  Automatic  speech  recognition:  A critical  survey  and 

discussion  of  the  literature.  In  Human  Communication:  A Unified  View, 

ed . by  E.  E.  David,  Jr.  and  P.  B.  Denes.  (N.Y.:  McGraw-Hil 1 1 pp . 399- 

438. 


56 


Ingemann,  F.  (1957)  Speech  synthesis  by  rule.  J.  Acoust . Soc . Am.  29, 
1255(A). 

Kelly,  J.  L.  and  L.  J.  Gerstman.  (1961)  An  artificial  talker  driven  from  a 
phonetic  input.  J . Acoust ■ Soc ■ Am . 33 , 835(A). 

Kelly,  J.  L.  and  C.  Lochbaum.  (1962)  Speech  synthesis.  Proceedings  of  the 
Speech  Communications  Seminar,  paper  F7 . (Stockholm:  Speech  Transmis- 
sion Laboratory,  Royal  Institute  of  Technology). 

Klatt,  D.  H.  (1976)  Linguistic  uses  of  segmental  duration  in  English: 
Acoustic  and  perceptual  evidence.  J . Acoust . Soc . Am . 59 , 1208-1221. 

Klima,  E.  S.  (1975)  Sound  and  its  Absence  in  the  Linguistic  Symbol.  In  The 
Role  of  Speech  in  Language , ed . by  J.  F.  Kavanagh  and  J.  E.  Cutting. 
(Cambridge:  MIT  Press),  pp . 247-270. 

Kozhevnikov,  V.  A.  and  L.  A.  Chistovich.  ( 1965)  Rech ' Art ikul iat s ia  _i 
vospr iat ie , Moscow-Leningrad . Trans,  as  Speech : Art icul at  ion  and 

Percept  ion . (Washington:  Clearinghouse  for  Federal  Scientific  and 

Technical  Information),  JPRS,  30. 

Lackner,  J.  R.  and  B.  K.  Levine.  (1975)  Speech  production:  Evidence  for 

syntactically  and  phonological  ly  determined  units.  Percept.  Psychophys. 
\J_,  107-113. 

Ladefoged,  P.  (1971)  Preliminaries  to  Linguistic  Phonetics . (Chicago: 
University  of  Chicago  Press). 

Lehiste,  I.  (1970)  Suprasegmentals . (Cambridge:  MIT  Press). 

Lehman,  W.  ( 1975)  Historical  Linguistics , 2nd  ed . (New  York:  Holt,  Rine- 

hart & Winston) . 

Liberman,  A.  M.  (in  press)  How  abstract  must  a motor  theory  of  speech 
perception  be?  In  Proceedings  of  the  Eighth  Internat ional  Congress  of 
Phonetic  Sciences  , L^eds  , England , 1 7-23  August , 1975 . [Also  in  Haskins 
Laboratories  Status  Report  on  Speech  Research  SR-44 , 1-15,  (1975).] 

Liberman,  A.  M.  ( 1976)  Discussion  paper.  In  Origins  and  Evolut ion  o f 
Language  and  Speech,  ed . by  S.  R.  Harnad,  H.  D.  Steklis,  and  J.  Lancaster 
(New  York:  New  York  Academy  of  Sciences),  718-724.  [ Annals  of  the  New 

York  Academy  of  Sciences  280,  718-724  (1976).] 

Liberman,  A.  M.  (1974)  The  specialization  of  the  language  hemisphere.  In 
The  Neurosc iences : Third  Study  Program , ed . by  F.  0.  Schmitt  and 

F.  G.  Worden.  ( Cambridge , Mass.:  MIT  Press),  pp . 43-56. 

Liberman,  A.  M.  (1970)  The  grammars  of  speech  and  language.  Cog. 
Psychol . I,  301-323. 

Liberman,  A.  M.  and  D.  B.  Pisoni.  (in  press)  Evidence  for  a special  speech- 
perceiving subsystem  in  the  human.  In  The  Recognit ion  o f Complex 
Acoustic  Signals,  ed . by  T.  H.  Bullock.  (Berlin:  Dahlem  Konferenzen)  . 

Liberman,  A.  M.,  F.  S.  Cooper,  D.  P.  Shankweiler,  and  M.  Studdert-Kennedy . 
( 1967)  The  perception  of  the  speech  code.  Psychol.  Rev.  74 , 431-461. 

Liberman,  A.  M.,  P.  C.  Delattre,  and  F.  S.  Cooper.  (1952)  The  role  of  se- 
lected stimulus-variables  in  the  perception  of  the  unvoiced  stop  conso- 
nants. Am . J . Psychol . 55 , 497-516. 

Liberman,  A.  M. , F.  Ingemann,  L.  Lisker,  P.  Delattre,  and  F.  S.  Cooper. 
(1959)  Minimal  rules  for  synthesizing  speech.  J.  Acoust.  Soc.  Am.  31, 
1490-1499. 

Liberman,  A.  M.  , I.  G.  Mattingly,  and  M.  T.  Turvey.  ( 1972)  Language  codes 

and  memory  codes.  In  Cod ing  Processes  in  Human  Memory , ed . by 
A.  W.  Melton  and  E.  Martin.  (Washington,  D.C.:  V.  H.  Winston),  pp . 


57 


307-334. 

Lisker,  L.  (1957a)  Some  cues  to  the  voiced-voiceless  distinction  among  the 
intervocalic  stops  in  English.  Haskins  Laboratories  24th  Quarterly 
Progress  Report , Appendix  5. 

Lisker,  L.  ( 1957b)  Closure  duration  and  the  intervocalic  voiced-voiceless 
distinction  in  English.  Language  33,  42-49. 

MacNeilage,  P.  F.  (1970)  Motor  control  of  serial  ordering  of  speech. 
Psych.  Rev.  77 , 182-196. 

Martin,  J.  G.  (1975)  Rhythmic  expectancy  in  continuous  speech  perception. 
In  Stimulus  and  Process  in  Speech  Percept  ion ■ (New  York:  Springer- 

Verlag) , pp . 161-177. 

Martin,  J.  G.  (1972)  Rhythmic  hierarchical  versus  serial  structure  in  speech 
and  other  behavior.  Psychol.  Rev.  79,  487-509. 

Massaro,  D.  W.  (1972)  Preperceptual  images,  processing  time  and  perceptual 
units  in  auditory  perception.  Psych.  Rev.  79 , 124-145. 

Massaro,  D.  W.  and  M.  M.  Cohen.  (1976)  The  contribution  of  fundamental  fre- 
quency and  voice  onset  time  to  the  /zi/-/si/  distinction. 
J.  Acoust.  Soc . Am.  60,  704-717. 

60,  S 7 5 ( A ) . 


Mattingly,  I.  G. (1976)  Syllable  synthesis.  J.  Acoust.  Soc.  Am.  

[Also  in  Haskins  Laboratories  Status  Report  on  Speech  Research  SR-49 


Speech 

Trends 


G.  (1974) 

In  Current  

Arts  and  Sciences , 
2451-2488. 

G.  (1972) 


synthesis  for  phonetic  and  phonological 
in  Linguistics , Vol.  12:  Linguistics  and 

ed . by  T.  A.  Sebeok,  et  al~!  (The  Hague: 


Speech  cues  and  sign  stimuli.  American  Scientist 


( 1977) .] 

Mattingly,  I. 
models . 

Adjacent 
Mouton) , 

Mattingly,  I. 

60,  327-337. 

Mattingly,  I,  G.  and  A.  M.  Liberman.  (1969)  The  speech  code  and  the  physiol- 
ogy of  language.  In  Information  Processing  in  the  Nervous  System , ed . by 
K.  N.  Leibovic.  (New  York:  Spr inger-Verlag) , pp . 97-117. 

Mattingly,  I.  G.,  A.  M.  Liberman,  G.  K.  Syrdal,  and  T.  Halwes.  (1971)  Dis- 
crimination in  speech  and  nonspeech  modes.  Cog . Psychol  . 2_,  131-157. 

McNeill,  D.  and  L.  Lindig.  (1973)  The  perceptual  reality  of  phonemes,  syll- 
ables, words  and  sentences.  J.  Verbal  Learn.  Verbal  Behav.  1 2 , 419-430. 

Mermelstein,  P.  (1975)  Automatic  segmentation  of  speech  into  syllabic  units. 
J . Acoust . Soc  . Am.  58_,  880-883. 

Miller,  G.  A.  and  P.  Nicely.  (1955)  An  analysis  of  some  perceptual  confu- 
sions among  some  English  consonants.  -Jj.  Acoust . Soc  ■ Am.  27,  338-352. 

Miller,  J.  D.  (in  press)  Perception  of  speech  sounds  by  animals:  Evidence 

for  speech  processing  by  mammalian  auditory  mechanisms.  In  Recognition 
of  Complex  Acoustic  Signals , ed . by  T.  H.  Bullock  (Berlin:  Dahlem  Kon- 

ferenzen  1977 ) . 

Mitchell,  P.  D.  (1973)  A test  of  differentiation  of  phonemic  feature  con- 
trasts. Unpublished  Ph.D.  Dissertation,  City  University  of  New  York. 

Ochiai,  K.  and  0.  Fujimura.  (1971)  Vowel  identification  and  phonetic  con- 
texts. (Report  of  University  of  Electro-Communications,  Chofu,  Tokyo), 
22-2,  103-111. 

Ogden,  C.  K.  (1967)  Opposit ion . (Bloomington:  Indiana  University  Press). 

Peterson,  G.  E.  and  H.  L.  Barney.  (1952)  Control  methpds  used  in  a study  of 
vowels.  J ■ Acoust ■ Soc . Am . 24 , 175-184. 

Peterson,  G.  E.,  W.  S-Y . Wang,  and  E.  Sivertsen.  (1958)  Segmentation  techni- 


58 


“ ' ''  ’ 


wmmmxm 


ques  of  speech  synthesis.  J . Acoust . Soc . Am . , 30 , 739-742. 

Pickett,  J.  M.  and  L.  R.  Decker.  (I960 T Time  factors  in  perception  of  a 
double  consonant.  Lang.  Speech  3,  11-17. 

Pisoni,  D.  B.  (in  presT5 Speech  Perception.  In  Handbook  of  Learning  and 
Cognitive  Processes , ed . by  W.  Estes,  vol. 6. (Hillsdale,  N.J.: 

Lawrence  Erlbaum  Associates). 

Port,  R.  F.  (1976)  The  influence  of  speaking  tempo  on  the  duration  of 
stressed  vowel  and  medial  stop  in  English  trochee  words.  Unpublished 
Ph.D.  dissertation.  University  of  Connecticut. 

Postal,  M.  (1968)  Aspects  of  Phonological  Theory . (New  York:  Harper  & 

Row)  . 

Rand,  T.  C.  (1971)  Vocal  tract  size  normalization  in  the  perception  of  stop 
consonants.  Haskins  Laboratories  Status  Report  on  Speech  Research  SR- 
25/26,  141-146. 

Rand,  T.  C.  (1974)  Dichotic  release  from  masking  for  speech.  J ■ 
Acoust . Soc.  Am.  55_,  678-680. 

Raphael,  L.  J~  Cl972)  Preceding  vowel  duration  as  a cue  to  the  perception  of 
voicing  characteristic  of  word-final  consonants  in  American  English. 
J.  Acoust.  Soc.  Am.  5 1 , 1296-1303. 

Raphael,  L.  J.,  M.  F.  Dorman,  and  A.  M.  Liberman.  (1975)  Perception  of  vowel 
and  syllable  duration  in  VC  and  CVC  syllables.  J.  Acoust.  Soc.  Am.  58, 
S57(a) . 

Rubin,  P.,  Turvey,  M.  T.,  and  P.  Van  Gelder.  (1976)  Initial  phonemes  are 
detected  faster  in  spoken  ■ words  than  in  spoken  nonwords. 
Percept . Psychophys . 19 , 394-398. 

Savin,  H.  B.  and  T.  G.  Bever . (1970)  The  non-percept ual  reality  of  the 

phoneme.  Verbal  Learn . Verbal  Behav . 9_,  295-302. 

Schatz,  C.  D.  (1954)  The  role  of  context  in  the  perception  of  stops. 
Language  30,  47-56. 

Shankweiler,  D.  P.  and  M.  Studdert-Kennedy . ( 1967)  Identification  of  conso- 

nants and  vowels  presented  to  left  and  right  ears.  Quart . J . Exp ■ 
Psychol . 19  , 59-63. 

Shattuck,  S.  R.  and  D.  H.  Klatt.  (1976)  The  perceptual  similarity  of  mirror- 
image  acoustic  patterns  in  speech.  Percept . Psychophys ■ 20 , 470-474, 

Stevens,  K.  N.  (1975)  The  potential  role  of  property  detectors  in  the 
perception  of  consonants.  In  Auditory  Analysis  and  Percept  ion  of  Speech , 
ed . by  C.  G.  M.  Fant  and  M.  A.  A.  Tatham.  (New  York:  Academic  Press) , 

303-330. 


Stevens,  K.  N.  and  A.  S.  House.  ( 1972)  The  perception  of  speech.  In 
Foundat ions  of  Modern  Auditory  Theory , ed . by  J.  Tobias.  (New  York: 
Academic  PresT),  pp.  3-62. 

Strange,  W. , R.  Verbrugge,  D.  P.  Shankweiler,  and  T.  R.  Edman . (1976)  Conso- 

nant environment  specifies  vowel  identity.  J.  Acoust.  Soc.  Am.  60,  1 98— 

212. 


Studdert-Kennedy,  M.  (1974)  The  perception  of  speech. 
Linguistics , ed . by  T.  A.  Sebeok,  vol . 12  (4). 

pp.  2349-2385. 

Studdert-Kennedy,  M.  (1975a)  From  acoustic  signal 
J ■ Communic ■ Pis.  8,  181-188. 

Studdert-Kennedy,  M.  (1975b)  From  continuous  signal 
Syllable  to  phoneme.  In  The  Role  o f Speech 


Ir  Current  Trends  in 
(The  Hague:  Mouton) , 

to  phonetic  message. 

to  discrete  message: 
in  Language , ed . by 


59 


J.  F.  Kavanagh  and  J.  E.  Cutting.  (Cambridge:  MIT  Press),  pp . 113-125. 

Sr  "ddert-Kennedy , M.  ( 1976)  Speech  perception.  In  Contemporary  Issues  in 
Experimental  Phonetics , ed . by  N.  J.  Lass.  (New  York:  Academic  Press), 

pp.  243-293. 

Studdert-Kennedy , M.  (in  press)  Universals  in  phonetic  structure  and  their 
role  in  linguistic  communication.  In  The  Recogni t ion  of  Compl ex  Acoust ic 
Signals,  ed . by  T.  H.  Bullock.  (Berlin:  Dahlem  Konferenzen)  . 

Studdert-Kennedy,  M.  and  D.  P.  Shankweiler.  (1970)  Hemispheric  specialization 
for  speech  perception.  J.  Acoust.  Soc . Am.  48 , 579-594. 

Summer  field,  A.  Q.  and  P.  J.  Bailey.  ( 1977)  On  the  dissociation  of  spectral 
and  temporal  cues  for  stop  consonant  manner.  J.  Acoust.  Soc.  Am.  61  , 
S(  A) . 

Summerfield,  A.  Q.  and  M.  P.  Haggard.  (1977)  On  the  dissociation  of  spectral 
and  temporal  cues  to  the  voicing  distinction  in  initial  stop  consonants. 
Haskins  Laboratories  Status  Report  on  Speech  Research  SR-49 , 1-36 . 

Svennson,  S-G . (1974)  Prosody  and  grammar  in  speech  perception.  MILUS  2 . 

(Institute  of  Linguistics,  University  of  Stockholm). 

Warren,  R.  M.,  C.  J.  Obusek,  R.  M.  Farmer,  and  R.  P.  Warren.  (1969)  Auditory 
sequence:  Confusion  of  patterns  other  than  speech  and  music.  Sc ience 

164,  586-587. 

Warren,  R.  M.  (1976a)  Auditory  sequence  and  classification.  In  Contemporary 
Issues  in  Experimental  Phonetics , ed . by  N.  J.  Lass.  (New  York: 
Academic  Press) , pp . 389-41 7 . 

Warren,  R.  M.  (1976b)  Auditory  perception  and  speech  evolution.  In  Origins 
and  Evolution  of  Language  and  Speech,  ed . by  S.  R.  Harnad,  H.  D.  Steklis, 
and  J.  Lancaster.  (New  York:  New  York  Academy  of  Sciences),  708-717. 
[Annals  of  the  New  York  Academy  of  Sciences  280,  708-717.) 

Whitfield,  I.  C.  (1965)  'Edges'  in  auditory  information  processing.  In 
Proceedings  of  XXIIIrd  Internat ional  Congress  of  Physiological  Sciences , 
(Tokyo),  September,  pp.  245-247. 

Whitfield,  I.  C.  and  E.  F.  Evans.  (1965)  Responses  of  auditory  cortical 
neurons  to  stimuli  of  changing  frequency.  J . Neurophysiol . 28  , 655-672. 

Wickelgren,  W.  A.  (1969)  Context-sensitive  coding,  associative  memory  and 
serial  order  in  (speech)  behavior.  Psychol.  Rev.  76,  1-15. 


60 


Cardiac  Indices  of  Infant  Speech  Perception:  Orienting  and  Burst  Discr imina- 

t ion* 

Cynthia  L.  Millert,  Philip  A.  Morsel  and  Michael  F.  Dormantt 


ABSTRACT 


The  present  study  investigated  burst  cue  discrimination  in  3-  to 
4-month-old  infants  with  the  natural  speech  stimuli  [bu]  and  [gu]. 
The  experimental  stimuli  consisted  of  either  a [bu]  or  a [gu]  burst 
attached  to  the  formants  of  the  [bu]  , such  that  the  sole  difference 
between  the  two  stimuli  was  the  initial  burst  cue.  Infants  were 
tested  using  a cardiac  orienting  response  (OR)  paradigm  that  con- 
sisted of  20  tokens  of  one  stimulus  (for  example,  [bu])  followed  by 
20  tokens  of  the  second  syllable  (20/20  paradigm).  An  OR  to  the 
stimulus  change  revealed  that  young  infants  can  discriminate  burst 
cue  differences  in  speech  stimuli.  Discussion  of  the  results 
focused  on  asymmetries  observed  in  the  data  and  the  relationship  of 
these  findings  to  our  previous  failure  to  demonstrate  burst  discrim- 
ination using  the  habituation/dishabituation  cardiac  measure  gener- 
ally employed  with  older  infants. 
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of  Wisconsin-Madison.  This  paper  is  in  press  in  the  Quarterly  Journal  of 
Experimental  Psychology . 
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INTRODUCTION 


The  perception  of  speech  involves  the  integration  of  overlapping  acoustic 
cues  in  the  form  of  a complex  code,  the  primary  unit  of  which  has  been  defined 
as  the  syllable  (Liberman,  Cooper,  Shankweiler  and  Studdert-Kennedy , 1967; 
Liberman,  1970).  In  syllables  consisting  of  a stop  consonant  plus  a vowel,  a 
brief,  initial  burst  of  energy,  and  (in  some  stops)  aspiration,  typically 
precede  the  formant  transitions  and  steady-state  vowel  components  of  the 
syllable.  Although  the  importance  of  all  of  these  components  in  adult  speech 
perception  has  been  investigated,  perhaps  the  least  studied  and  most  contro- 
versial of  these  components  has  been  the  initial  burst  cue. 

The  burst  cue  consists  of  (i),  a brief  explosion  (less  than  20  msec) 
produced  by  the  release  of  occlusion,  and  (ii)  a very  brief  (0-10  msec)  period 
of  frication.  The  duration  and  frequency  characteristics  of  the  burst  vary  as 
a function  of  place  of  articulation,  vowel  context  and  voicing.  The  perceptu- 
al significance  of  the  burst  also  varies  as  a function  of  these  factors.  For 
example,  bursts  contribute  little  to  the  identification  of  / b / except, 
possibly,  in  back  vowel  environments;  bursts  contribute  significantly  to  the 
identification  of  / d / in  front  vowel  environments,  but  relatively  less  in 
center  and  back  vowel  environments;  and  bursts  are  generally  important  for 
identification  of  / g/  (for  details  see  Fischer-Jtfrgensen , 1972  ; Dorman, 
Studdert-Kennedy  and  Raphael,  in  press).  Bursts,  then,  are  important  cues  for 
stop  consonant  recognition. 


In  a recent  study  by  Morse,  Leavitt,  Miller  and  Romero  (in  press),  adult 
burst  discrimination  was  investigated  using  a nonverbal,  cardiac  measure.  In 
this  study  a heart-rate  (HR)  orienting  response  (OR)  habituation-dishabitua- 
tion  procedure  was  used  in  assessing  the  discrimination  of  a natural  [bu] 
([bu]  burst  + [bu]  formants)  and  a transposed  [gu],  consisting  of  a [gu] 
burst  attached  to  the  same  [bu]  formants.  In  the  type  of  paradigm  employed  by 
Morse  et  al.,  the  subject  is  presented  with  repeated  trials  of  a familiariza- 
tion stimulus  followed  by  1 or  2 trials  of  a change  stimulus.  To  allow  for 
recovery,of  a cardiac  response  to  trial  offset,  the  intertrial  intervals  (ITI) 
in  this  paradigm  typically  vary  between  25  and  60  secs.  In  the  Morse  et 
aT . study,  subjects  received  8 trials  [each  consisting  of  8 stimuli  with  an 
intersimulus  interval  ( I SI ) of  1 sec]  of  the  familiarization  stimulus  (either 
[bu]  or  [gu])  followed  by  2 trials  of  the  change  stimulus.  Intertrial 
intervals  (offset  to  onset)  varied  randomly  between  25  and  35  sec.  In  this 
study,  habituation  of  the  cardiac  component  of  the  orienting  response  (HR 
deceleration)  to  trial  onset  was  observed  across  the  familiarization  trials. 
Dishabituat ion  (recovery  of  the  orienting  response)  was  found  to  occur  in 
response  to  the  onset  of  the  change  trials,  thereby  indicating  d iscr iminat ion 
of  those  burst  cues.  In  addition,  this  cardiac  evidence  of  discrimination  was 
accompanied  by  verbal  reports  of  discrimination. 


If  bursts  do  play  such  an  important  role  in  adult  speech  perception,  then 
it  is  of  interest  to  examine  the  developmental  course  of  their  importance 
beginning  in  early  infancy.  Research  on  infant  speech  perception  has  shown 
that  by  four  months  of  age,  infants  can  already  discriminate  formant  transi- 
tion and  steady-state  vowel  information  in  a manner  similar  to  the  adult’s 
perception  of  these  cues  (for  example,  Eimas,  Siqueland,  Jusczyk  and  Vigorito, 
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POSTSTIMULUS  SECONDS 


Figure  1 


: HR  difference  score  data  from  the  8/2  procedure.  Ten  trials  are 

grouped  into  5 trial  blocks  of  2 trials  each  (Miller,  et  al., 


1971;  Eimas , 1 9 7 s 1975a;  Miller,  1974;  Miller  and  Morse,  1976;  Swoboda,  Morse 
and  Leavitt,  1976;  Till,  1976).  In  contrast,  no  comparable  data  are  yet 
available  on  the  infant's  ability  to  discriminate  differences  in  the  burst 
component  of  the  syllable.  Since  variations  of  the  cardiac  habituation- 
d ishab ituat ion  paradigm  described  above  have  been  successively  employed  in 
several  studies  of  infant  auditory  discrimination  (Moffitt,  1971;  Berg,  1972; 
Lasky,  Syrdal-Lasky  and  Klein,  1975),  an  earlier  study  (Miller,  Morse  and 
Dorman,  1975)  attempted  to  investigate  infant  burst  cue  d iscr iminat ion  using 
the  same  heart  rate  procedure  and  stimuli  employed  in  the  Morse  et  al . (in 
press)  adult  study.  The  3-  to  4-month  old  infants  in  this  study  (hereafter 
referred  to  as  Miller  et  al . , 1975a)  were  presented  with  8 trials  of  either 
[bu]  or  [gu]  and  2 trials  of  the  change  stimulus.  The  cardiac  data  from  this 
study  are  depicted  in  Figure  1.  In  this  figure,  the  10  trials  of  the  8/2 
procedure  are  grouped  into  5 trial  blocks  (TB)  of  2 trials  each.  Analyses  of 
variance  for  trends  over  seconds  and  trends  over  trials  performed  on  these 
data  confirmed  the  observation  of  a reliable  orienting  response  (OR)  to  trial 
onset  on  the  first  few  trials  (TB  1-3)  that  subsequently  habituated  over 
trials.  However,  dishabituation  on  the  change  trials  (TB  5)  was  not  observed, 
thus  suggesting  that  these  infants  were  not  capable  of  discriminating  this 
burst  contrast  (for  additional  details  of  the  methodology  and  results  of  this 
study,  cf.  Miller  et  al.,  1975).  However,  recent  developmental  studies  of 
infant  auditory  discrimination  suggest  that  this  conclusion  may  be  premature. 
Although  infants  older  than  4 months  of  age  have  been  found  to  readily  exhibit 
auditory  discrimination  with  variations  of  the  8/2  cardiac  OR  paradigm 
(Moffitt,  1971;  Berg,  1974;  Lasky,  et  al.,  1975),  infants  between  6 weeks  and 
12  weeks  have  failed  to  demonstrate  auditory  discrimination  when  these 
paradigms  were  employed  (Berg,  1974;  Brown,  Leavitt,  and  Graham,  1975;  in 
press;  Leavitt,  Brown,  Morse  and  Graham,  in  press).  For  example,  Brown  et 
al . ( 1975)  recently  employed  a 6/2  cardiac  paradigm  (6  familiarization  trials, 
2 novel  trials)  in  assessing  the  discrimination  of  an  auditory  change  in  12- 
week  old  infants.  As  in  the  Miller  et  al . (1975a)  study.  Brown  et  al.  (1975) 
observed  significant  orienting  and  habituation,  but  no  dishabituation  to  a 
stimulus  change. 

In  contrast  to  the  failure  of  infants  less  than  4 months  of  age  to 
evidence  auditory  discrimination  using  an  habituation-dishabituation  cardiac 
paradigm,  several  investigators  have  reported  aud itory/ speech  discrimination 
in  infants  as  young  as  4 weeks  of  age  using  an  operant  high-amplitude  sucking 
paradigm  (Eimas,  Siqueland,  Jusczyk  and  Vigorito,  1971;  Trehub  and  Rabino- 
vitch,  1972).  Consequently,  the  absence  of  discrimination  in  the  present 
experiment  (and  in  the  Brown  et  al . , 1975  , study)  may  more  likely  reflect  a 
developmental  limitation  of  the  habituation-dishabituation  paradigm,  rather 
than  an  inability  of  young  infants  to  discriminate  burst  cues.  A recent  study 
by  Leavitt  et  al . (in  press)  further  supports  this  suspicion.  Leavitt  et 
al.  failed  to  obtain  auditory  discrimination  in  6-week  olds  using  a 6/2 
paradigm,  but  when  a cardiac  paradigm  was  employed  in  which  the  intertrial 
intervals  of  the  habit uat ion/d ishab it uat  ion  procedure  were  eliminated  (a  no- 
ITI  paradigm),  6-week  olds  did  exhibit  auditory  discrimination. 

In  sum,  this  developmental  pattern  of  heart  rate  results  suggests  that 
the  3-  to  4-month  old  infants'  competence  in  discriminating  burst  cues  may  be 
better  assessed  with  a no-ITI  cardiac  paradigm  than  with  an  habituation- 
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dishabituation  cardiac  procedure.  Therefore,  the  present  experiment  employed 
a no-ITI  paradigm  similar  to  that  used  by  Leavitt  et  al . to  investigate 
further  the  burst  discrimination  of  3-4-month  old  infants.  In  this  study  20 
tokens  of  one  syllable  (for  example,  [ bu ] ) were  followed  immediately  by  20 
i tokens  of  a change  syllable  ([gu])  and  discrimination  was  indexed  by  an  OR  to 

the  stimulus  shift. 

METHOD 


Subjects 


Twelve  infants,  aged  3-4  months  (mean  = 3 mos , 1.5  wks),  served  as 

subjects.  The  participation  of  parents  in  the  greater  Madison  area  was 
solicited  by  a letter  describing  the  research  and  a follow-up  phone  call.  A 
total  of  29  infants  was  tested  with  14  (48  percent)  eliminated  on  the  basis  of 
predetermined  state  criteria,  2 because  of  equipment  problems,  and  1 due  to 
experimenter  error. 1 The  12  remaining  infants  included  5 males  and  7 females. 

Apparatus 

Each  subject  was  tested  in  an  infant  seat  positioned  on  a table-like 
platform  in  an  Audio-Suttle  sound-attenuated  chamber.  Throughout  the  session, 
the  parents  and  experimenter  were  able  to  monitor  visually  the  infant's 
behavior  over  a closed-circuit  television  system.  Stimuli  were  played  to  the 
subject  on  a TEAC  3300S  2-track  tape  deck  coupled  to  a Bogen  Challenger 
amplifier  and  Hewlett-Packard  attenuator.  An  Acoustic  Research  2 ax  speaker, 
p.  *.  located  40"  in  front  of  the  infant,  presented  the  stimuli  at  70  + 1 dB  (A)  SPL 

Sf  against  a background  level  of  2 7 dB  (A).  Sound  level  measurements  were  made 

with  a General  Radio  Sound  Level  Meter  (#1551-C,  microphone  771560-P5)  placed 
at  the  site  of  the  infant's  head. 

A stimulus  artifact  on  the  second  channel  of  the  stimulus  tape,  denoting 
stimulus  onset  and  change,  occurred  coincident  with  the  first  and  twenty-first 
stimuli  of  each  20/20  trial.  A Scientific  Prototype  audio  threshold  relay 
detected  the  stimulus  artifact  and  converted  it  into  a suitable  pulse  for 
recording  on  one  channel  of  a Sony  TC  756  2-track  tape  deck.  Cardiac  activity 
was  detected  by  Beckman  biopotential  miniature  skin  electrodes  and  amplified 
by  a Gilson  polygraph.  The  two  active  electrodes  were  placed  2-3  cm  above  the 
right  nipple  and  approximately  2-3  cm  above  and  to  the  left  of  the  navel.  A 
ground  electrode  was  placed  2-3  cm  above  the  left  nipple.  Sites  for  electrode 
placement  were  prepared  with  alcohol  and  the  electrodes  were  attached  with 
either  Beckman  or  Beck-Lee  paste  and  micropore  tape.  An  adjustable  pulser 

(converted  each  R wave  in  the  electrocardiogram  (EKG)  into  a square  pulse 
suitable  for  recording  on  the  second  channel  of  the  Sony  TC  756  tape  deck. 


l()f  the  14  infants  rejected  for  state,  10  had  at  least  one  behaviorally 
acceptable  trial.  Thus,  only  4 infants  (15  percent)  did  not  contribute 
acceptable  data  for  the  first  trial,  suggesting  that  because  of  the  low 
attrition  rate  this  paradigm  is  a desirable  one  for  infant  researchers. 
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St imul i 


The  natural  speech  stimuli  [bul  and  [gu],  produced  by  an  adult  male 
speaker,  were  used  in  constructing  the  experimental  [bu]  and  [gu]  stimuli 
shown  in  Figure  2.  The  experimental  [bu]  was  the  natural  [bu]  initially 
produced  by  the  speaker.  The  experimental  [gu]  was  produced  by  removing  the 
burst  portion  from  the  natural  [bu]  and  replacing  it  with  the  burst  portion  of 
the  natural  [gu].  This  experimental  stimulus  was  consistently  identified  as  a 
[gu]  by  over  50  adult  Ss . The  duration  of  the  [bu]  was  450  msec  with  an  8 
msec  burst,  whereas  the  474  msec  [gu]  contained  a 32  msec  burst.  The 
construction  and  recording  of  the  stimuli  were  carried  out  on  the  PCM  system 
at  Haskins  Laboratories  (Cooper  and  Mattingly,  1969).  These  were  the  identi- 
cal stimuli  employed  in  the  Miller  et  al.  (1975)  study. 

Procedure 


'jpon  arrival  at  the  laboratory,  the  parents  were  briefed  on  the  proce- 
dures and  purposes  of  the  research  and  their  consent  was  solicited  prior  to 
the  test  session.  The  infant  was  then  placed  in  the  infant  seat  in  the 
testing  chamber  and  the  electrodes  were  affixed  to  the  infant's  chest.  When 
the  infant  was  judged  to  be  in  a quiet,  alert  state,  stimulus  presentations 
began.  The  general  parameters  of  the  20/20  paradigm  and  the  orders  presented 
to  subjects  are  depicted  in  Table  1 together  with  the  contrasting  features  of 
the  8/2  procedure  employed  in  the  Miller  et  al.  (1975)  study.  As  can  be  seen 
in  Table  1,  each  subject  was  presented  with  4 20/20  sequences  (ISI  = 1 sec), 
each  separated  by  a 30-second  pause.  The  order  of  stimulus  change  within  the 
four  trials  was  alternated  from  [bu]  -*■  [gu]  to  [gu]  -*■  [bu]  (or  vice  versa)  and 
the  order  of  presentation  on  the  first  trial  was  counterbalanced  across 
subjects,  such  that  half  of  the  subjects  received  a [bu]  -*•  [gu]  shift  on  trial 
1 (Group  A)  and  half  a [gu]  [bu]  shift  (Group  B).  The  duration  of  each 
20/20  sequence  was  1 minute  and  that  of  the  entire  experimental  session 
approximately  5.5  minutes. 

Throughout  the  session,  an  assistant  seated  inside  the  chamber  and  out  of 
the  infant's  sight  observed  and  recorded  the  infant's  behavior  using  a closed- 
circuit  TV  monitor.  Behavior  recording  occurred  for  5 seconds  prior  to  and  10 
seconds  following  each  trial  onset  and  change  and  included  visual  behavior 
(for  example,  fixation,  eye  widening),  body  movements,  vocalization,  sucking 
behavior,  and  states  of  arousal  (for  example,  fussy,  drowsy,  alert).  Subjects 
were  eliminated  from  the  study  only  if  they  exhibited  excessive  fussiness, 
drowsine99,  and/or  large  movements  during  the  behavior  recording  periods 
(cf.  Leavitt,  1975,  for  further  details  of  recording  and  acceptance  criteria). 

Data  Reduct  ion 

Each  R wave  of  the  infant's  EKG  was  recorded  as  a square  pulse  on  audio 
tape  and  the  R-R  (interbeat)  intervals  for  5 prestimulus  and  15  poststimulus 
seconds  were  computed  by  a PDP-12  computer.  These  data  were  then  converted  by 
a Datacraft  computer  into  a beat s-per-minute  (bpm)  measure  for  each  pre-  and 
postst imul us  second.  Prestimulus  level  was  calculated  for  one  second  prior  to 
each  trial  onset  and  change.  Analyses  were  performed  on  difference  scores 
calculated  by  subtracting  this  prestimulus  level  from  each  of  the  subsequent 


66 


c 


o 

•H 

P 

P 

4-» 

CO 

o 

CQ 

t 

■U 

a> 

+ 

t 

n 

a) 

T> 

>-< 

p 

P 

to 

O 

CQ 

O 

<D 

u 

C 

CP 

P 

P 

P 

CQ 

o 

PQ 

+ 

+ 

t 

P 

p 

P 

O 

CQ 

O 

rH 

CN 

CO 

CQ 

3 

P 

P 

O 

CQ 

o 

+ 

t 

i 

P 

p 

« 

O 

CQ 

»— I 

CM 

CO 

< 

3 

U 


4 GU  BU  4 BU 


15  post st imul us  seconds.  F-tests  were  performed  by  a Datacraft  computer  on 
the  mean  difference  scores  to  determine  significant  departures  from  prestim- 
ulus level  for  each  poststimulus  second  and  trend  analyses  were  carried  out  by 
a UNIVAC  1110  computer. 2 


RESULTS 


Onset  Data 


The  onset  data  of  the  20/20  procedure  are  displayed  for  the  two  orders  of 
presentation  on  trial  1 (as  detailed  in  Table  1)  in  Figure  3.  These  data  were 
subjected  to  an  analysis  of  variance  for  trends  with  Order  of  Presentation  on 
trial  1 (A  = (bu)  ■+■  [gu]  vs  B = [gu]  -*■  [bu])  as  a between-sub jects  factor  and 
within-sub jects  factors  of  Shift  Condition  within  the  4 trials  ([bu]"Mgu]  vs 
[gu]  -*  [bu]),  Trial  Blocks  (TB  1 = trials  1 and  2 vs  TB  2 = trials  3 and  4), 
and  Seconds  (15). 

A reliable  orienting  response  to  onset  over  all  trials  was  shown  in  the 
significant  quadratic  trend  over  Seconds,  £(1,10)  = 42.46,  £ < .001. 
Furthermore,  a significant  main  effect  for  Trial  Blocks,  £(1,10)  = 6.6, 
£ < .01,  suggested  that  the  initial  OR  habituated  from  the  first  to  the  second 
half  of  the  session.  Although  no  significant  main  effect  for  Order  of 
Presentation  was  obtained,  a significant  quadratic  trend  over  Seconds  x Order 
x Condition  interaction,  £(1,10)  = 23.07,  £ < .001,  was  observed.  In  addi- 
tion, the  quadratic  trend  over  Seconds  x Condition  x Trial  Blocks  interaction, 
£(1,10)  = 7.75,  £ < .025,  was  also  found  to  be  reliable.  As  can  be  seen  in 
Figure  3,  these  interactions  indicate  that  the  magnitude  of  the  initial 
orienting  response  varied  in  the  two  orders  of  initial  presentation  as  a 
function  of  stimulus  shift  and  the  first  vs.  the  second  half  of  the  session. 

Change  Data 


The  change  data  are  separated  for  the  two  orders,  A and  B,  in  Figure  4. 
F-tests  on  the  difference  scores  of  these  data  revealed  that  orienting 
occurred  on  every  trial  in  which  there  was  a [bul  -*  [gu]  shift  (trial  1, 
£ < .01;  trial  2,  £ < .05;  trial  3,  £ < .01;  trial  4,  £ < .050. ^ In  contrast, 
no  significant  orienting  (or  acceleration)  occurred  on  those  trials  in  which 
there  was  a [gu]  -*■  [bu]  shift.  A 4-way  analysis  of  variance  identical  to  that 
employed  for  the  onset  data,  was  performed  on  the  change  data  and  confirmed 
the  existence  of  differential  responding  to  the  stimulus  change.  A signifi- 
cant main  effect  of  Shift  Condition,  £(1,10)  = 6.05,  £ < .05,  a significant 
Seconds  x Shift  Condition  interaction,  £(14,140)  = 2.75,  £ < .005,  and  a 
significant  cubic  trend  over  Seconds  x Shift  Condition  interaction, 


2Wil  son's  (1974)  CARDIVAR  package  was  employed  in  preparing  the  R-R  interval 
data  for  subsequent  analyses.  The  F-test  and  trend  analysis  programs  were 
developed  and  generously  made  available  to  us  by  Dr.  F.  K.  Graham. 

*All  significance  levels  for  these  F-test  analyses  were  converted  to  the 
Bonferroni  t (Myers,  1972). 


F(l,10)  = 19.25,  £ < .005,  all  indicated  that  there  were  different  responses 
To  the  two  types  of  stimulus  change.  The  absence  of  any  order  effects 
confirmed  the  consistency  of  the  stimulus  shift  effects  across  both  orders  A 
and  B . ^ 

DISCUSSION 

Asymmetry  in  Discrimination 

Two  possible  interpretations  that  may  be  offered  for  the  asymmetry 
observed  in  this  study  are  related  to  the  construction  of  the  experimental 
stimuli.  The  first  possibility  is  that  the  [gu]  stimulus  may  have  been  more 
salient  to  the  infant,  and  therefore  a greater  elicitor  of  orienting.  The 
[ gu]  burst,  relative  to  the  [bu]  burst,  is  slightly  longer  in  duration,  has  a 
greater  energy  concentrat ion  within  a more  restricted  frequency  range,  and  is 
acoustically  incongruous  with  the  [bu]  formant  transitions  which  follow.  Any 
or  all  of  these  features  may  have  enhanced  the  saliency  of  the  [gu]  stimulus. 5 

A second  possibility  is  that  during  the  first  20  stimuli,  the  burst  cue 
is  being  adapted  in  a manner  similar  to  that  reported  in  adults  for  a variety 
of  other  speech  cues,  including  bursts  (Eimas  and  Corbit,  1973;  Ades , 1974; 
Cooper,  1974;  Blumstein  and  Stevens,  1975;  Diehl,  1975;  Ganong , 1975;  Tartter 
and  Eimas,  1975;  Morse,  Kass  and  Turkienicz,  1976).  In  the  present  study,  if 
the  [gu]  burst  is  adapted,  the  remaining  cues  of  the  experimental  [gu] 
stimulus  are  the  formant  transitions  of  the  [bu] . Consequently,  if  adaptation 
occurs  within  the  first  20  stimuli,  the  [gu]  stimulus  would  become  a [bu],  and 
hence  the  shift  to  [bu]  might  not  be  discriminable . In  contrast,  when  the 
[bu]  burst  adapts,  a [bu]  remains  and  would  be  discriminated  from  a shift  to 
[gu].  Although  adult  studies  of  speech  adaptation  have  not  attempted  to 
determine  whether  reliable  adaptation  effects  can  occur  within  only  20 
presentations  of  a stimulus,  this  possibility  cannot  be  ruled  out,  either  in 
the  adult  or  in  the  infant. 

Although  the  asymmetry  observed  in  the  present  study  was  not  anticipated, 
asymmetries  in  infant  speech  d iscriminat ion  have  also  been  reported  elsewhere 
(Butterfield  and  Cairns,  1974).  Since  Eimas  (1975b)  has  suggested  that  the 
process  of  adaptation  may  be  responsible  for  much  of  the  evidence  of  infant 


^Subsequently,  a group  of  12  infants  was  tested  in  a control  condition  (no 
stimulus  change),  that  included  one  20/20  trial  of  either  [bu]  or  [gu].  F- 
tests  and  trend  analyses  performed  on  these  data  revealed  no  signifiant 
cardiac  deceleration  following  the  twenty-first  stimulus.  These  data  confirm 
the  conclusion  that  the  OR  recovery  to  the  [bu]  ■+■  [gu]  shifts  actually 
reflected  burst  discrimination,  rather  than  cyclic  changes  in  cardiac  activi- 
ty. 

"’Although  inspection  of  Figure  3 suggests  that  greater  onset  orienting  did 
occur  to  the  [gu]  stimulus,  there  is  no  statistical  support  for  this 
observation.  However,  a similar  study  of  infant  burst  discrimination  by 
Miller,  Goy,  Morse  and  Dorman  (1975)  did  find  statistical  evidence  of  greater 
initial  orienting  to  [gi]  and  [bi],  using  burst  stimuli  constructed  in  t, 
manner  similar  to  those  of  the  present  study. 
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speech  discrimination  obtained  with  the  nonnutritive  sucking  paradigm,  more 
direct  tests  of  this  adaptation  account  using  infant  paradigms  with  adult 
listeners  may  greatly  aid  in  elucidating  the  mechanisms  responsible  for  these 
asymmetries 

Discrimination:  8/2  vs . 20/ 20  Paradigm 

The  major  implication  of  the  present  experiment  is  that  infant  burst 
discrimination  is  dependent  upon  the  particular  paradigm  employed.  These 
results  revealed  that  infants  are  capable  of  this  discrimination  when  tested 
with  the  20/20  paradigm,  yet  do  not  appear  so  within  the  8/2  paradigm  (Miller 
et  al.,  1975a).  Since  no  study  has  reported  cardiac  dishabituation  to 
auditory  stimuli  in  infants  younger  than  4 months,  it  is  possible  that  the 
lack  of  evidenced  discrimination  in  the  Miller  et  al.  (1975a)  study  may 
reflect  an  inability  of  young  infants  to  demonstrate  OR  dishabituation  to  any 
change  in  auditory  stimulation.  Thi  interpretation  is  consistent  with  the 
data  reported  by  Brown  et  al.  (1975,  in  press)  that  suggest  developmental 
trends  in  the  characteristic  propert'es  of  orienting  behavior  (that  is, 
initial  OR,  habituation,  dishabituation)  to  auditory  stimuli.  However,  since 
Adkinson  and  Berg  (1974)  have  observed  cardiac  dishabituation  to  visual 
stimuli  in  newborns,  this  conclusion  remains  somewhat  tenuous. 

Perhaps  the  more  productive  way  of  interpreting  the  difference  between 
these  two  studies  would  be  to  examine  the  physical  parameters  of  the  two 
paradigms  employed  (cf.  Table  1).  Two  obvious  parametric  differences  result- 
ing in  the  different  stimulus  distributions  of  these  two  paradigms  are:  1) 

the  number  of  familiar  stimuli  preceding  the  stimulus  shift,  and  2)  the  ITI's 
separating  blocks  in  the  8/2  procedure,  which  are  absent  in  the  20/20 
paradigm.  Since  there  are  actually  fewer  tokens  of  the  familiar  stimulus 
presented  to  the  infant  prior  to  the  change  in  the  20/20  paradigm  (20,  as 
opposed  to  64  in  the  8/2  procedure),  any  differences  in  memory  for  the 
familiar  syllable  cannot  be  due  to  the  total  number  of  prechange  exemplars. 

If,  instead,  the  ITI's  of  the  8/2  paradigm  were  primarily  responsible  for 
the  different  results  obtained  in  these  two  experiments,  then  it  may  be 
because  these  lengthy  silent  intervals  in  some  manner  imposed  too  great  a 
burden  upon  the  infant's  processing  of  these  burst  stimuli.  In  other  words, 
the  distribution  over  time  of  the  stimuli  in  the  8/2  procedure  may  have 
resulted  in  less  consolidation  of  the  stimulus  being  stored  in  memory  and/or 
some  decay  during  the  ITI  of  the  "neuronal  model"  (Sokolov,  1963)  of  the  burst 
stimulus.  Consequently,  the  habituation  observed  in  the  8/2  paradigm  may  have 
reflected  the  development  of  a more  general  model  of  the  stimulus  presented, 
and  thus  the  absence  of  discrimination  in  this  paradigm  is  not  surprising. 
Unfortunately,  the  development  of  a parametric  research  program  to  answer 
these  questions  may  be  complicated  by  several  factors.  First,  the  general 
parameters  of  the  habituation-dishabituation  paradigm  were  originally  concep- 
tualized to  include  recovery  time  between  trials  for  an  OR  to  stimulus  offset. 
In  addition,  some  recent  work  (Roth  and  Morse,  1975)  suggests  that  for  speech 
sounds,  an  infant's  orienting  response  to  initial  stimulus  onset  may  require 
some  20-30  seconds  for  recovery.  Thus,  one  cannot  simply  vary  the  ITI  between 
blocks  of  stimuli  in  moving  from  a 20/20  paradigm  to  an  8/2  procedure. 
Although  the  results  of  the  present  study  do  not  resolve  these  questions  about 
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the  processes  underlying  cardiac  measures  of  infant  discrimination,  they  do 
demonstrate  that  with  one  cardiac  paradigm,  burst  discrimination  does  occur  in 
early  infancy. 

In  conclusion,  the  results  of  the  present  experiment  have  several 
implications  for  our  understanding  of  the  development  of  infant  speech 
perception.  First,  they  reveal  that  young  infants  can  discriminate  very  brief 
burst  cues  in  stop  consonants,  thus  adding  to  the  list  of  important  acoustic 
events  in  the  speech  signal  to  which  infants  are  sensitive  at  a very  early 
age.  Second,  the  consistent  pattern  of  asymmetry  suggests  that  more  direct 
tests  of  adaptation  proposals  for  infant  (and  adult)  burst  discrimination  may 
greatly  enhance  our  understanding  of  the  mechanisms  that  underlie  infant 
speech  perception.  Third,  the  burst  discrimination  obtained  with  the  20/20 
procedure  of  the  present  study  suggests  (as  Leavitt  et  al . observed)  that  a 
no-ITI  paradigm  may  be  more  useful  in  studying  the  speech  discrimination  of 
infants  younger  than  4 months  than  the  more  traditional  habituation-dishabitua- 
tion  procedure.  The  recent  evidence  of  categorical  discrimination  for  place 
of  articulation  using  the  20/20  paradigm  in  3-  to  4-month  old  infants  (Miller 
and  Morse,  1976)  further  underscores  the  usefulnesss  of  this  paradigm  in 
studying  infant  speech  discrimination. 

REFERENCES 


Ades,  A.  (1974)  A bilateral  component  in  speech  perception.  J. 

Acoust , Soc . Am.  610-616. 

Adkinson,  C.  D.  and  Berg,  W.  K.  (1974)  Habituation  and  dishabituation  of 
cardiac  orienting  in  newborns.  Psychophysiol . 11,  219. 

Berg,  W.  K.  (1972)  Habituation  and  dishabituation  of  cardiac  responses  in  4- 
month-old , alert  infants.  J.  Exp.  Child  Psychol.  14 , 92-107. 

Berg,  W.  K.  (1974)  Cardiac  orienting  responses  of  6-  and  16-week-old  in- 
fants. J.  Exp.  Child  Psychol.  17,  303-312. 

Blumstein,  S.  E.  and  Stevens,  K.  N.  (1975)  Property  detectors  for  bursts  and 
transitions  in  speech  perception.  J.  Acoust.  Soc.  Am.  57  (S)l,  52. 

Brown,  J.  W.  , Leavitt,  L.  A.  and  Graham,  F.  K.  ( 1975 ) Infant  cardiac  re- 

sponse during  a six-week  transition  period.  Research  Status  Report  1, 
Infant  Development  Laboratory  (Madison:  University  of  Wisconsin) , 65- 

89. 

Brown,  J.  W.,  Leavitt,  L.  A.  and  Graham,  F.  K.  (in  press)  Response  to 
auditory  stimuli  in  six-  and  nine-week-old  infants.  Developmen. 
Psychobiol . 

Butterfield,  E.  and  Cairns,  G.  (1974)  Discussion  summary  - Infant  perception 
research.  In  Language  Perspectives  - Acquisition,  Retardation,  and 
Intervention,  ed.  by  R.  Schie fe lbusch  and  L.  Lloyd.  ( Baltimore , Md . : 
University  Park  Press),  pp.  75-102. 

Cole,  R.  A.  and  Scott,  B.  ( 1974a)  The  phantom  in  the  phoneme:  Invariant 

characteristics  of  stop  consonants.  Percept.  Psychophys.  15,  101-107. 

Cole,  R.  A.  and  Scott,  B.  (1974b)  Toward  a theory  of  speech  perception. 

Psychol . Rev . 8 1 , 348-374. 

Cooper,  F.  S.  and  Mattingly,  I.  G.  (1969)  Computer-controlled  PCM  system  for 
investigation  of  dichotic  speech  perception.  Haskins  Laboratories  Status 
Report  on  Speech  Research  SR— 17/18  , 1 7-21 . 

Cooper,  W.  E.  (1974)  Adaptation  of  phonetic  feature  analyzers  for  place  of 
articulation.  J.  Acoust.  Soc.  Am.  56,  617-627. 


Diehl,  R.  L.  (1975)  The  effect  of  selective  adaptation  on  the  identification 
of  speech  sounds.  Percept.  Psychophys.  17,  48-52. 

Dorman,  M.  F.,  Studdert-Kennedy , M.  and  Raphael,  L.  J.  (in  press)  Stop- 
consonant  recognition:  Release  bursts  and  formant  transitions  as  func- 

tionally equivalent,  context-dependent  cues.  Percept.  Psychophys. 

Eimas , P.  D.  (1974)  Auditory  and  linguistic  processing  of  cues  for  place  of 
articulation  by  infants.  Percept.  Psychophys.  16 , 513-521. 

Eimas,  P.  (1975a)  Auditory  and  phonetic  coding  of  the  cues  for  speech: 
discrimination  of  the  [ r— 1 ] distinction  by  young  infants. 

Percept.  Psychophys.  18 , 341-347. 

Eimas,  P.  D.  ( 1975b;  Speech  perception  in  early  infancy.  In  Infant 

Percept  ion , ed.  by  L.  B.  Cohen  and  P.  Salapatek.  (New  York:  Academic 

Press) . 

Eimas,  P.  D.  and  Corbit,  J.  D.  (1973)  Selective  adaptation  of  linguistic 

feature  detectors.  Cog.  Psychol . 4,  99-109. 

Eimas,  P.  D.,  Siqueland,  E.  R.  , Jusczyk,  P.  and  Vigorito,  J.  (1971)  Speech 
perception  in  infants.  Science  171 , 303-306. 

Fischer-J^rgenson , E.  (1972)  Perceptual  studies  of  Danish  stop  consonants. 
Annual  Report  VI.  (Copenhagen:  Institute  of  Phonetics,  University  of 

Copenhagen) . 

Ganong,  W.  F.  (1975)  Phonetic  adaptation  measured  on  two  continua  between 
the  same  syllables.  J.  Acoust.  Soc . Am.  58 , (S)l,  58. 

Lasky,  R.  E.,  Syrdal-Lasky , A.  and  Klein,  R.  F.  (1975)  VOT  discrimination  by 

four  to  six  and  a half  month  old  infants  from  Spanish  environments. 

J.  Exp.  Child  Psychol.  20,  215-225. 

Leavitt,  L.  A.  ( 1975)  State  rating  scales  used  in  the  Infant  Development 

Laboratory.  Research  Status  Report  Infant  Development  Laboratory. 

(Madison:  University  of  Wisconsin) , 341-345. 

Leavitt,  L.  A.,  Brown,  J.  W.  , Morse,  P.  A.  and  Graham,  F.  K.  (in  press) 
Cardiac  orienting  and  auditory  discrimination  in  six-week  infants. 
Developmental  Psychology. 

Liberman,  A.  M.  ( 1970)  The  grammars  of  speech  and  language. 
Cog.  Psychol . 1.,  301-323. 

Liberman,  A.  M.,  Cooper,  F.  S.,  Shankweiler,  D.  and  Studdert-Kennedy,  M. 

(1967)  Perception  of  the  speech  code.  Psychol.  Rev.  74,  431-461. 

Liberman,  A.  M.  , Delattre,  P.  C.  and  Cooper,  F.  S.  ( 1952)  The  role  of 

selected  stimulus  variables  in  the  perception  of  unvoiced  stop  conso- 
nants. Am.  J.  Psychol . 65 , 497-516. 

Miller,  C.  L.,  Goy,  E.  R. , Morse,  P.  A.  and  Dorman,  M.  F.  (1975)  Selected 
problems  in  infant  burst  discrimination.  Research  Status  Report  1, 
Infant  Development  Laboratory.  (Madison:  University  of  Wisconsin) , 171— 

181. 

Miller,  C.  L.  and  Morse,  P.  A.  (1976)  The  "heart"  of  categorical  speech 
discrimination  in  young  infants.  Sp.  Hear . Res . 19 , 578-589. 

Miller,  C.  L.,  Morse,  P.  A.  and  Dorman,  M.  F.  (1975)  Memory  for  burst  cues 
in  infant  burst  discrimination.  Research  Status  Report  1,  Infant  Devel- 
opment Laboratory.  (Madison:  University  of  Wisconsin),  149-169. 

Miller,  J.  L.  (1974)  Phonetic  determination  of  infant  speech  perception. 
Unpublished  doctoral  dissertation,  University  of  Minnesota,  Minneapolis, 
Minnesota . 

Moffitt,  A.  R.  (1971)  Consonant  cue  perception  by  twenty-to-twenty-four-week- 
old  infants.  Child.  Dev.  42,  717-731. 


Morse,  P.  A.,  Kass,  J.  E.  and  Turkieni.cz,  R.  (1976)  Selective  adaptation  of 
vowels.  Percept.  Psychophys.  19 , 137-143. 

Morse,  P.  A.,  Leavitt,  L.  A.,  Miller,  C.  L.  and  Romero,  R.  C.  (in  press) 

Overt  and  covert  aspects  of  adult  speech  perception.  J.  Sp.  Hear.  Res. 

Myers,  J.  A.  (1972)  Fundamentals  of  Experimental  Design . (Boston:  Allyn  & 

Bacon,  Inc.). 

Roth,  P.  L.  and  Morse,  P.  A.  0.975)  An  invest igat ion  of  infant  VOT  discrim- 
ination using  the  cardiac  OR.  Research  Status  Report  _1.,  Infant  Develop- 
ment Laboratory.  (Madison:  University  of  Wisconsin),  207-218. 

Sokolov,  E.  N.  (1963)  Perception  and  the  Conditioned  Reflex.  (New  York: 

Macmillan) . 

Swoboda,  P.  J.,  Morse,  P.  A.  and  Leavitt,  L.  A.  (1976)  Continuous  vowel 

discrimination  in  normal  and  at-risk  infants.  Child  Develop.  47 , 459- 
465. 

Tartter,  V.  S.  and  Eimas,  P.  D.  (1975)  The  role  of  auditory  feature  detec- 
tors in  perception  of  speech.  Percept.  Psychophys.  18 , 293-298. 

Till,  J.  (1976)  Infant's  discrimination  of  speech  and  nonspeech  stimuli. 
Unpublished  doctoral  dissertation,  University  of  Iowa,  Iowa  City,  Iowa. 

Trehub,  S.  E.  and  Rabinovitch,  M.  S.  (1972)  Auditory-linguistic  sensitivy  in 
early  infancy.  Developmen,  Psychol.  6_,  74-77. 

Wilson,  R.  A.  (1974)  CARDIVAR:  The  statistical  analysis  of  heart  rate  data. 

Psychophys iol . 1 1 , 76-85. 


76 


Cine  fluorographic  and  Electromyographic  Studies  of  Articulatory  Organization* 
Thomas  Gayt 


. ^ ^ ABSTRACT 

Articulatory  studies  dealing  with  the  spreading  effect  of 
coarticulation  and  the  control  of  speech  rate  are  reviewed.  The 
results  of  some  more  recent  research  is  then  reviewed  with  a view 
toward  refining  earlier  formulations.  The  suggestion  is  made,  in 
contrast  to  traditional  views,  that  the  segmental  input  to  the 
speech  string  is  governed  by  simple  rules  that  operate  within  a 
limited  coart iculatory  field,  while  the  temporal  organization  of  the 
string  requires  complex  articulatory  adjustments  based  on  advanced 
information  obtained  from  a higher  level  scan-ahead  mechanism. 

INTRODUCTION 


This  paper  summarizes  the  results  of  several  experiments  that  used  the 
techniques  of  cinefluorography  and  electromyography  to  study  the  organization 
of  speech  gestures.  As  such,  it  does  not  represent  a comprehensive  review  of 
current  speech  production  theory,  but  rather  is  directed  towards  a discussion 
of  several  specific  issues  that  are  best  studied  by  these  techniques:  the 

dynamics  of  articulatory  movements  and  the  motor  command  structure  that 
underlies  those  movements. 

Although  speech  is  usually  described  in  terms  of  a string  of  invariant 
segmental  units  (phonemes),  the  act  of  speaking  imposes  on  this  string  a 
complex  encoding.  This  is  a consequence  of  the  series  of  events  that  comprise 
the  speech  production  chain:  the  conversions  of  motor  command-to-musc le 

contraction,  muscle  contract ion-to-vocal  tract  shape,  and  vocal  tract  shape-to- 
acoustic  signal.  The  result  of  this  encoding  is  observed  as  variation  in  both 
the  production  of  a given  phone  and  in  its  acoustic  representation.  This 
paper  will  be  concerned  with  allophonic  variation  as  it  appear*  at  the 
articulator’  level,  specifically,  variations  that  arise  from  changes  in 


*This  paper  was  presented  at  the  U.S.  Japan  Joint  Seminar  on  Dynamic  Aspects 
of  Speech  Production,  Hill-Top  Hotel,  Tokyo,  Japan,  7-10  December,  1976. 
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phonetic  context,  and  variations  that  arise  from  changes  in  the  supr asegment al 
structure  of  the  string,  particularly  changes  in  speech  rate. 


! 


THE  ORGANIZATION  OF  SEGMENTAL  GESTURES 


Coarticulation  is  usually  defined  as  allophonic  variation  of  a given 
phone  due  to  changes  in  its  phonetic  environment.  The  production  of  a phone 
can  be  conditioned  by  a phone  that  either  precedes  it  ( left-to-right  or 
carryover  effects)  or  follows  it  (right-to-left  or  anticipatory  effects). 

Anticipatory  coarticulation  effects  are  essentially  timing  effects: 
movements  toward  some  parts  of  a feature  target  of  a given  segment  begin 
before  others.  Kozhevnikov  and  Chistovich  (1965)  studied  the  anticipation  of 
lip  rounding  that  occurs  when  a rounded  vowel  follows  a consonant  and 
suggested  that  the  forward  extent  of  this  anticipatory  gesture  was  the  first 
consonant  in  the  sequence.  Daniloff  and  Moll  (1968)  also  showed  that  lip 
rounding  can  begin  ahead  of  the  syllable  boundary  and  across  as  many  as  four 
consonants  preceding  the  vowel.  In  their  experiment,  anticipation  of  lip 
rounding  for  the  vowel  /u/  was  studied  for  a number  of  mono-  and  disyllabic 
single  and  two-word  utterances  imbedded  in  sentence  frames,  using  lateral  view 
x-ray  motion  pictures.  Onset  of  lip  rounding  usually  begins  with  the  first 
consonant  of  the  utterance.  Another  type  of  anticipatory  coarticulation  was 
shown  to  exist  by  Ohman  (1966).  In  a spectrographic  study  of  coarticulation 
in  VCV  sequences,  Ohman  concluded  that  the  variability  observed  in  transition- 
al movements  to  the  consonant  could  be  predicted  by  the  formant  frequencies  of 
the  second  vowel.  This  led  Ohman  to  conclude  further  that  vowel-to-vowel 
movement  in  a VCV  is  essentially  diphthongal  with  the  consonant  simply 
superimposed  on  the  basic  gesture;  in  other  words,  anticipatory  movements 
toward  the  second  vowel  begin  independently  from  those  toward  the  consonant. 
These  studies,  among  others,  suggest  that  articulatory  encoding  is  a complex 
phenomenon  whose  effects  can  spread  across  several  adjacent  segments.  Most 
support,  either  explicitly  or  implicitly,  Henke's  (1966)  articulatory  model 
that  proposes  the  operation  of  a mechanism  that  scans  future  segmental  inputs, 
or  features  thereof,  and  sends  commands  for  the  immediate  attainment  of  those 
feature  targets  that  would  not  interfere  with  the  attainment  of  immediately 
intervening  articulations.  However,  in  two  recent  studies,  both  cinefluoro- 
graphic  (see  Gay,  in  press)  and  electromyographic  (see  Gay,  1974b),  evidence 
was  used  to  argue  against  the  pervasiveness  of  anticipatory  coart iculat ion  in 
speech . 

In  the  cinefluorographic  experiment,  conventional  high  speed  (60  fps) 
lateral  view  x-ray  films  were  recorded  from  two  subjects  who  produced  various 
V<  V syllables  that  contained  the  vowels  /i,a,u/  and  the  consonants  ,'p,t,k/  in 
possible  combinations.  Articulatory  movements  were  tracked  by  recording 
positions,  frame-by-frame , of  2.5  mm  diameter  lead  pellets  that  were 
* 1 "*■  to  the  upper  and  lower  lips,  jaw,  and  several  locations  along  the 
f the  tongue  relative  to  a reference  pellet  attached  at  the  embrasure 
-per  central  incisors.  These  data  will  be  used  to  explore  the 
whether,  in  a VCV  utterance,  an  intervening  consonant  constrains 
‘ the  art iculat ors , in  particular  the  tongue  body  and  lips,  from 
v”*ther;  is  the  movement  from  one  vowel  to  another  essentially 
• It  somehow  locked  to  the  consonant  (Ohman1 s hypothesis),  and 
• og  gesture  for  the  postvocalic  rounded  vowel  begin  ahead  of 


the  intervocalic  consonant?  (Henke's  model). 

The  dynamic  properties  of  articulatory  movements  in  a vcv  sequence  are 
illustrated  in  Figure  1 for  an  utterance  where  the  intervocalic  consonant  is 
/ p/ . This  figure  shows  the  movement  tracks  of  the  tongue  body,  lips,  and  jaw 
in  the  height  dimension  for  the  sequence  /ipa/.  Each  track  was  graphed  from 
discrete  points  measured  every  film  frame,  that  is,  at  approximately  17  msec 
intervals.  Measurements  begin  during  the  closure  period  of  the  initial  / k/ 
and  end  at  the  time  of  closure  for  the  final  / p/ ; 0 on  the  abscissa 
corresponds  to  the  time  of  intervocalic  consonant  closure.  This  figure 
illustrates  the  general  finding  that  the  intervocalic  consonant  affects  the 
timing  of  the  movements  of  the  tongue  body  from  vowel  to  vowel.  The  movement 
of  the  tongue  body  from  the  first  vowel  to  the  second  vowel  does  not  begin 
until  after  closure  for  the  intervocalic  consonant  is  completed.  This  was 
found  to  be  a salient  feature  in  the  production  of  all  VCV  utterances. 
Consonant  constraints  on  vowel-to-vowel  movements  were  as  evident  in  the 
front-back  dimension  as  in  the  height  dimension,  and  the  rules  that  apply  to 
/ p/  also  apply  when  the  intervocalic  consonant  is  either  / t / or  / k/ . The  only 
variability  in  the  timing  effect  appears  in  the  delay  time  between  consonant 
closure  and  tongue  body  movement.  While  the  lag  was  usually  of  the  order  of 
30  msec,  it  varied  anywhere  from  10-60  msec.  This  figure  also  shows  that  the 
movements  of  the  tongue  body,  because  they  begin  ahead  of  those  for  the  jaw, 
are  probably  independent  from  jaw  movements  towards  the  vowel.  As  is  also 
evident  in  this  figure,  upper  lip  contributions  to  lip  closure  were 
negligible.  Finally,  this  subject  showed  a pattern  of  lip  closure  that  was 
often  characterized  by  continued  compression  throughout  the  closure  period. 

Perhaps  the  best  illustration  of  consonantal  constraints  on  tongue  body 
movements  is  one  where  the  first  and  second  vowels  of  the  utterance  are  the 
same.  Figure  2 shows  the  movement  tracks  for  the  jaw  and  four  tongue  pellets 
during  the  production  of  / i t i / for  Subject  FSC.  Instead  of  the  tongue 
maintaining  the  / i/  target  during  the  consonant,  the  tongue  blade  and  both 
tongue  body  pellets  show  movement  throughout  the  consonant  gesture.  The  blade 
and  anterior  tongue  body  pellet  appear  to  shadow  movements  of  the  tip  while 
the  posterior  tongue  body  pellet  moves  in  the  opposite  direction  (lower), 
probably  in  a facilitory  gesture.  However,  since  the  tongue  body  gesture 
(when  it  does  appear)  does  not  reach  a specific,  repeatable  location,  it  is 
not  interpreted  as  being  target  directed. 

The  discontinuity  of  the  vowel  target  in  an  utterance  where  the  same  two 
vowels  are  separated  by  a consonant  is  also  evident  at  the  EMG  level  (see  Gay, 
1974b).  The  average  EMG  activity  of  the  genioglossus  muscle  for  the  sequences 
/ i pi / and  / i t i / is  illustrated  in  Figure  3.  The  genioglossus  muscle,  which 
comprises  the  bulk  of  the  tongue  body,  is  primarily  responsible  for  the 
protruding  and  bunching  associated  with  the  vowel  / i/ . This  figure  shows 
three  separate  peaks  associated  with  the  htterance.  The  first  peak  corres- 
ponds to  the  initial  / k/ , while  the  second  and  third  correspond  to  the  first 
and  second  vowels.  Of  particular  interest  is  the  deep  trough  that  separates 
the  two  vowel  peaks.  The  presence  of  a trough,  which  signifies  a cessation  of 
genioglossus  activity,  suggests  that  the  two  vowels,  although  phonetically 
identical,  are  organized  as  two  separate  events.  If  the  movement  of  the 
tongue  body  during  the  production  of  the  consonant  as  observed  in  the  x-ray 
data  (Figure  2)  was  due  solely  to  the  movement  of  other  articulatory 
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structures,  such  as  the  tongue  tip  or  jaw,  positional  constancy  would  exist  at 
the  EMG  level  in  the  form  of  one  broad  genioglossus  peak  across  the  entire 
utterance.  However,  the  existence  of  two  distinct  peaks  separated  by  a deep 
trough  suggests  that  the  intervocalic  consonant  has  moie  than  a passive  effect 
on  tongue  body  movement. 

For  VCV  utterances  containing  either  / p/ , / t / or  /k/  as  the  intervocalic 
consonant , the  usual  sequence  of  articulatory  events  is  as  follows:  movements 
of  the  jaw,  tongue  body  and  primary  articulator  begin  at  about  the  same  time, 
with  jaw  closing  continuing  past  the  time  of  occlusion  for  the  consonant. 
Shortly  after  closure  for  the  consonant  occurs,  tongue  body  movement  toward 
the  second  vowel  begins.  This  movement  is  followed  independently,  by  jaw 
opening  and  release  of  the  consonant.  Articulatory  movements  for  the  post- 
vocalic vowel  always  begin  between  the  time  of  consonant  closure  and  consonant 
release.  Constraints  of  the  intervocalic  consonant  are  also  evident  at  the 
EMG  level  in  the  form  of  a separate  muscle  peak  for  each  syllable. 

The  data  of  both  the  cinefluorographic  and  electromyographic  experiments, 
in  showing  consonant  constraints  on  vowel  movement  in  a VCV  utterance,  argue 
against  Ohman's  (see  Oh man , 1966)  hypothesis  that  suggests  vowel- to- vowel 
movement  is  essentially  diphthongal.  If  Ohman's  model  were  correct,  tongue 
body  movements  toward  the  second  vowel  would  begin  at  about  the  time  of  onset 
of  closing  for  the  consonant . However,  movement  toward  the  second  vowel 
begins  much  later,  some  10-60  msec  after  closure  for  the  consonant  has  already 
been  completed.  This  suggests  that  either  the  tongue  body  itself  attains  a 
target  during  consonant  production,  or  more  likely,  that  the  intervocalic 
consonant  and  the  following  vowel  are  linked  in  a basic  gesture.  The  very 
short  lag  time  between  consonant  closure  and  movements  toward  the  second  vowel 
suggest  the  latter  possibility. 

In  addition  to  placing  constraints  on  the  movements  of  the  tongue  body 
from  one  vowel  to  another,  an  intervocalic  consonant  also  affects  the  onset  of 
lip  rounding  for  a rounded  second  vowel.  These  constraints  are  evident  at 
both  the  articulatory  and  EMG  levels.  Lateral  view  x-rays  can  provide  an 
indication  of  lip  rounding  in  the  form  of  degree  of  lip  protrusion.  In  those 
cases  where  a rounded  vowel  appears  in  a post-consonantal  position,  the 
rounding  gesture,  like  tongue  body  movements,  does  not  begin  until  after 
closure  for  the  intervocalic  consonant  is  completed.  This  is  true  even  for 
the  most  sensitive  case,  namely,  two  rounded  vowels  separated  by  a close 
consonant.  Figure  4 shows  the  movement  tracks  of  lower  lip  height,  lower  lip 
protrusion,  and  tongue  tip  height  plotted  against  the  same  baseline  for  the 
syllable  /utu/,  produced  by  Subject  GNS.  Even  in  this  example,  it  is  evident 
that  the  rounding  feature  of  the  first  vowel  is  not  continuous  through  the 
consonant.  Rather,  what  appears  to  be  an  additional,  although  small,  closing 
and  protruding  gesture  is  superimposed  on  the  rounding  pattern.  This  discon- 
tinuity of  rounding  during  the  consonant  is  also  evident  in  the  EMG  data 
(Figure  5)  that  shows  a trough  ir.  orbicularis  oris  muscle  activity  during  the 
production  of  the  same  syllable.  Both  sets  of  data  argue  against  the  Daniloff 
and  Moll  (1968)  anticipatory  effect.  An  alternative  interpretation  of  Dani- 
loff and  Moll's  result  is  that  the  early  onset  of  lip  rounding  corresponded  to 
a closing  or  protruding  gesture  of  one  or  more  of  the  intervening  consonants 
in  the  utterance  (for  example,  the  Ini,  Is/,  / 1 / or  /r/  in  the  word 
"construe")  or  some  special  property  of  the  cluster  itself.  This  explanation 
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seems  even  more  likely  in  light  of  Bell-Berti  and  Harris'  (see  Bell-Berti  and 
Harris,  1976)  EMG  data,  which  shows  that  the  beginning  of  orbicularis  oris 
muscle  activity  in  utterances  containing  the  syllables  /stru/  and  /stri/ 
occurs  at  the  same  time. 

To  summarize  the  data  thus  far:  the  relative  timing  of  articulatory 
movements  in  a VCV  sequence  seems  to  be  affected  by  the  intervocalic 
consonant,  even  if  the  gesture  for  the  consonant  is  not  a contradictory  one. 
The  intervocalic  consonant  shows  effects  on  tongue  body  movements  toward  and 
the  lip  rounding  gesture  for  the  second  vowel  at  both  the  articulatory  and  EMG 
levels.  Anticipatory  movements  toward  the  second  vowel  begin  during  the 
closure  period  of  the  intervocalic  consonant,  suggesting  that  the  CV  component 
of  the  VCV  sequence  is  produced  as  a basic  unit. 

Unlike  anticipatory  coarticulation  effects  that  are  essentially  timing 
effects,  carryover  coarticulation  effects  are  usually  considered  as  mechanical 
effects  and  exist  in  the  form  of  variability  in  target  (or  target  feature) 
positions  as  a function  of  changes  in  phonetic  context.  Although  carryover 
effects  have  been  shown  to  exist  at  both  the  EMG  and  articulatory  levels,  the 
pervasiveness  of  these  effects  is  somewhat  in  doubt.  In  a study  of  the 
production  of  thirty-six  CVC  monosyllables,  MacNeilage  and  DeClerk  (1969) 
found  that  some  aspect  of  the  production  of  every  phone  was  always  influenced 
by  a following  phone.  In  particular,  the  size  of  the  EMG  signal  would  be 
different  depending  on  the  identity  of  the  adjacent  vowel  or  consonant.  In 
countering  the  argument  that  a motor  command  representation  of  the  phone  shows 
less  variability  than  an  articulatory  target  representation,  MacNeilage  (1970) 
later  proposed  that  the  observed  EMG  variability  reflected  a complex  motor 
strategy,  the  underlying  goal  of  which  is  a relatively  invariant  articulatory 
end.  The  concept  of  an  articulatory  based  target  system  as  proposed  by 
MacNeilage  was  further  supported,  at  least  for  vowels  by  the  c ine f luorograph ic 
data  of  Gay  et  al.  (Gay,  Ushijima,  Hirose  and  Cooper,  1974)  and  Gay  (1974a). 
In  the  latter  study,  lateral  view  x-ray  motion  pictures  were  obtained  from  two 
speakers  who  produced  the  vowels  /i,a,u/  in  a variety  of  VCV  contexts.  The 
results  of  this  experiment  showed  that  for  both  subjects,  the  target  positions 
for  both  / i / and  /u/,  in  both  pre-  and  post-consonantal  positions,  remained 
quite  stable  (within  2-3  mm)  across  changes  in  the  consonant  and  trans- 
consonantal  vowel.  Although  target  stability  for  /a/  was  also  the  rule  rather 
than  the  exception,  some  individual  differences  did  appear.  However,  the 
articulatory  variability,  when  it  did  appear,  did  not  correlate  with  any 
acoustic  variability. 

Similar  results  were  also  reported  in  a more  recent  cinefluorographic 
study  (Gay,  in  press).  Carryover  effects  of  an  intervocalic  consonant  on  the 
following  vowel  again  appeared  only  for  the  open  vowel  /a/,  and  were  reflected 
only  in  differences  in  jaw,  and  consequently,  tongue  body  height.  However, 
carryover  effects  of  a preceding  consonant  on  the  attainment  of  target 
positions  for  the  vowels  / i / and  /u/  were  minimal,  with  the  tongue  body 
targets  of  both  vowels  falling  within  a range  of  2.5  mm  for  one  subject  and  3 
mm  for  the  other.  This  lack  of  variability  is  illustrated  in  Figure  6.  The 
figure  shows  the  relative  positions  of  the  upper  lip,  lower  lip,  jaw  and 
tongue  body  at  the  time  the  tongue  body  reached  its  target  (point  of  maximum 
displacement)  for  each  of  nine  utterances  containing  the  vowel  / i / in  final 
position.  As  is  evident,  variability  %of  tongue  body  target  positions  is 
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minimal.  Lower  lip  and  jaw  positions,  on  the  other  hand,  vary  within  a larger 
range,  approximately  5 mm  for  subject  FSC  and  10  mm  for  subject  GNS. 
Interestingly,  lower  lip  and  jaw  targets  seem  to  vary  independently  from 
tongue  body  positions  but  covary  for  both  subjects.  This  finding  contradicts 
that  of  Hughes  and  Abbs  (1976)  who  showed  that  mouth  opening  for  / i / remained 
relatively  constant  because  of  trade-offs  between  lower  lip  and  jaw  displace- 
ments. This  type  of  equivalence  was  not  evident  in  the  present  data  for 
either  / i / or  /u/ . 

Carryover  effects,  then,  when  they  do  appear,  are  unlike  anticipatory 
effects  in  that  they  depend  on  the  phonetic  identity  of  the  particular 
segment.  Like  anticipatory  effects,  however,  carryover  effects  seem  to  spread 
no  farther  than  the  neighboring  phone.  Stability  of  tongue  body  targets  for 
vowels  (at  least  / i/  and  /u/)  is  the  rule  rather  than  the  exception.  The  only 
substantial  articulatory  variability  occurred  in  jaw  displacement,  with  /a/ 
showing  the  greatest  effects  and  /u/  the  least.  However,  variability  in  jaw 
displacement  for  /a/,  as  measured  anteriorly  at  the  incisors,  might  be  either 
exaggerated  or  irrelevant  in  relation  to  variability  that  might  exist  in  the 
pharyngeal  constriction  for  /a/.  Likewise,  the  variability  of  maximum  jaw 
displacement  for  both  / i / and  /u/  is  unrelated  to  the  variability  observed  in 
the  position  of  the  tongue  body  for  those  vowels.  Thus,  the  two  features, 
tongue  body  height  and  jaw  displacement,  are  probably  independent  ones,  with 
jaw  opening  being  a faeilitory  gesture  and  an  unmarked  phonetic  feature. 

SUPRASEGMENTAL  ORGANIZATION:  THE  CONTROL  OF  SPEECH  RATE 

In  the  preceding  section,  variability  in  the  production  of  a phone  due  to 
changes  in  phonetic  environment  was  discussed.  In  this  section,  questions 
concerning  allophonic  variation  that  arises  from  a different  source,  the 
suprasegmental  feature  of  speaking  rate,  will  be  explored. 

Experiments  on  the  effects  of  speaking  rate  and  stress  have  been 
concerned  primarily  with  the  question  of  whether  all  such  effects  can  be 
attributed  solely  to  changes  in  the  timing  of  commands  to  the  articulators. 
The  classic  experiments  on  the  effects  of  stress  and  speaking  rate  on  vowels 
were  conducted  by  Lindblom  (1963,  1964).  By  inferring  changes  in  articulator 
positions  from  sound  spectrograms,  Lindblom  found  a positive  correlation 
between  vowel  reduction,  or  "undershoot,”  and  either  decreased  stress  or 
increased  speaking  rate.  The  failure  of  the  articulators  to  reach  the  vowel 
"target"  was  attributed  to  the  close  temporal  succession  of  motor  commands, 
and  so  to  insufficient  time  to  complete  the  component  gestures.  In  addition, 
Lindblom' s speaking  rate  data  showed  that  the  rate  of  target-directed  articu- 
lator movement  remained  constant  across  changes  in  duration.  This  supported 
the  concept  of  undershoot  as  being  a time-based  phenomenon;  also,  it  implied  a 
simple  model  to  account  for  the  effects  of  stress  and  speaking  rate,  that  is, 
a cut-off  of  the  commands  rather  than  a complete  reorganization  of  the 
gesture . 

However,  in  two  separate  experiments,  one  cinefluorographic  (Gay  et  a 1., 
1974)  and  one  electromyographic  (Gay  and  Ushijima,  1974),  it  was  found  that 
changes  in  speaking  rate  are  brought  about  by  a complex  reprogramming  of  the 
input  to  the  speech  string.  For  example,  in  a combined  electromyographic- 
cine  fluorograph  ic  study  of  speaking  rate  control  (Gay  et  al.,  1974),  it  was 
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found  that  an  increase  in  speaking  rate  was  accompanied  by  a decrease  in  vowel 
duration  and,  for  the  most  part,  articulatory  undershoot.  However,  the  degree 
of  articulatory  undershoot  varied  with  both  the  individual  subject  and 
phonetic  identity  of  the  vowel.  These  differences  are  illustrated  in  Figure 
7.  This  figure  shows  tongue  body  and  jaw  movement  for  the  sequence  /api/,  for 
two  different  speaking  rates.  It  is  evident  that  the  degree  of  undershoot  is 
considerably  greater  for  the  open  vowel  /a/  than  for  the  close  vowel  / i/ . 

More  interesting  than  the  existence  of  articulatory  undershoot  for  fast 
speech  are  the  underlying  muscle  action  patterns  that  control  those  movements. 
The  EMG  data  provide  a fairly  complete  account  of  this  control  mechanism. 
These  data  show  that  lip  muscle  activity  (orbicularis  oris)  associated  with 
labial  consonant  production  and  tongue  tip  muscle  activity  (superior  longitu- 
dinal) associated  with  lingual  consonant  production  increase  for  fast  speech, 
while  genioglossus  muscle  activity  for  tongue  body  movement  during  vowel 
production  decreases  during  fast  speech.  These  results  are  illustrated  for 
the  sequence  / i pi / , in  Figure  8. 

The  first  finding  implies  an  increase  in  articulatory  effort  and  an 
increase  in  the  speed  of  articulatory  movement:  the  production  of  both  /p/ 
and  / 1 / requires  a complete  occlusion  of  the  vocal  tract,  which  must  be 
produced  more  quickly  and  with  greater  effort  during  fast  speech.  The 
reduction  in  EMG  activity  for  the  vowel  during  fast  speech,  on  the  other  hand, 
is  compatible  with  the  view  that  a vowel  target  has  a built-in  error  or 
tolerance  factor  that  can  absorb  the  extra  demands  in  speech  rate. 

Two  other  interesting  results  appeared  in  these  experiments.  One  was 
that  lip  muscle  activity  associated  with  rounding  for  /u/  also  increases  with 
an  increase  in  speaking  rate.  This  implies  that  the  different  effects  of 
changes  in  speaking  rate  are  related  either  to  specific  muscle  systems  or 
individual  phonetic  features  rather  than  basic  differences  in  phonetic  catego- 
ries. The  other  was  that,  for  both  subjects,  an  increase  in  speaking  rate  was 
accompanied  by  an  increase  in  the  frequency  levels  of  both  the  first  and 
second  formants.  Thus,  since  the  acoustic  triangle  is  not  reduced  toward  the 
neutral  schwa,  articulatory  undershoot  during  fast  speech  does  not  produce  the 
same  acoustic  result  as  articulatory  undershoot  during  destressed  speech. 

The  most  important  aspect  of  the  electromyographic  speaking  rate  data  is 
not  the  direction  in  which  the  amplitude  of  the  signals  change  for  consonants 
and  vowels,  but  rather  the  fact  that  they  change . Changes  in  both  the  timing 
and  amplitude  of  the  EMG  signals  with  changes  in  speaking  rate  signified  that 
the  control  of  speech  rate  requires  complex  motor  programming,  and  not  simply 
a reordering  of  the  timing  function  (Lindblom,  1963). 


SUMMARY 


The  major  points  made  in  this  paper  are  as  follows:  first,  anticipatory 
movements  toward  the  second  vowel  in  a VCV  sequence  begin  during  the  closure 
period  of  the  intervocalic  consonant.  This  restricted  coarticulatory  field 
includes  both  the  tongue  body  movement  and  lip  rounding  gesture  associated 
with  the  second  vowel.  Furthermore,  the  size  of  this  field  is  not  affected  by 
the  identity  of  the  intervocalic  consonant.  Second,  like  anticipatory  ef- 
fects, carryover  effects  do  not  seem  to  extend  beyond  an  immediately  neighbor- 
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Figure  7:  Articulatory  movements  for  two  speaking  rates.  The  filled  circles 

correspond  to  the  slow  rate  and  the  unfilled  circles,  the  fast 
rate . 


ing  segment.  Unlike  anticipatory  effects,  however,  the  appearance  of  carry- 
over coarticulation  effects  depends  on  the  phonetic  identity  of  the  particular 
segment  on  which  these  effects  might  act. 

The  implication  of  these  findings  is  that  the  rules  governing  the 
segmental  input  to  a VCV  string  are  probably  not  as  complex  as  present  models 
suggest.  The  fact  that  anticipatory  movements  begin  and  primary  carryover 
effects  end  at  about  the  same  time  during  the  closure  period  of  the  consonant, 
suggests  that  the  release  of  the  consonant  and  movement  toward  the  vowel  are 
organized  and  produced  as  an  integral  articulatory  event.  This  formulation 
argues  against  the  operation  of  a scan-ahead  mechanism  at  the  segmental  level. 
Rather,  all  features  of  both  elements  of  the  syllable  are  contained  within  the 
boundaries  of  that  unit. 

This  does  not  necessarily  mean,  however,  that  a scan-ahead  mechanism  does 
not  operate  at  another  stage  of  speech  production.  The  complex  reorganization 
of  commands  accompanying  changes  in  speaking  rate  suggests  that  the  temporal 
features  of  a downstream  segment  are  known  in  advance. 

Thus,  while  it  has  been  traditionally  considered  that  the  serial  ordering 
of  segments  is  governed  by  complex  rules  whose  effects  can  spread  across 
several  adjacent  segments,  and  the  temporal  control  of  speech  is  governed  by  a 
simple  adjustment  of  timing  of  commands  to  the  articulators,  it  may  well  be 
that  the  reverse  is  true:  the  segmental  input  to  the  speech  string  is 

governed  by  simple  rules,  while  the  temporal  formulation  of  the  string 
requires  complex  articulatory  adjustments  based  on  advance  information 
obtained  from  a higher  level  scan-ahead  mechanism. 
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Stimulus  Dominance  and  Ear  Dominance  in  the  Perception  of  Dichotic  Voicing 
Contrasts 

Bruno  H.  Repp 


ABSTRACT 


Two  studies  were  conducted  to  determine  the  effect  of  varia- 
tions in  voice  onset  time  (VOT)  on  the  perception  of  dichotic  stop- 
consonant-vowel  syllables  contrasting  in  the  voicing  feature.  The 
dichotic  stimuli  were  partially  fused,  so  that  only  a single 
response  was  required.  Variations  in  VOT  had  a systematic  effect  on 
the  probability  of  hearing  the  fused  stimuli  as  voiced  or  voiceless. 
Changing  the  VOT  of  a voiceless  stimulus  had  a larger  effect  than 
changing  the  VOT  of  a voiced  stimulus.  Unless  one  of  the  competing 
stimuli  was  close  to  the  category  boundary,  the  perceptual  integra- 
tion of  their  VOTs  seemed  to  be  roughly  additive.  The  relative 
phase  of  the  periodic  portions  of  the  stimuli  had  an  unexpected 
effect  on  perception  that  remains  to  be  explained.  A number  of 
subjects  showed  very  strong  right-ear  dominance  in  these  tests.  The 
range  and  reliability  of  the  laterality  effects  obtained,  as  well  as 
certain  other  methodological  features,  make  the  present  tests 
promising  as  tools  for  assessing  individual  differences  in  ear 
dominance . 


INTRODUCTION 


When  two  different  auditory  stimuli  are  presented  simultaneously  to  the 
two  ears,  the  perceptual  result  depends  on  a number  of  factors.  One  of  these 
is  dichotic  (binaural,  lower-level)  fusion . It  determines  whether  one  or  two 
stimuli  are  heard.  If  the  two  inputs  are  very  similar  in  their  spectral  and 
temporal  characteristics,  they  may  fuse,  so  that  only  a single  stimulus  is 
heard.  If  the  two  stimuli  are  dissimilar,  two  separate  events  are  heard  at 
the  two  ears,  but  it  may  nevertheless  be  difficult  to  identify  both  of  them 
correctly.  Two  other  factors  determine  which  of  the  two  competing  inputs  is 
perceptually  more  prominent.  "Perceptual  prominence"  means,  in  the  case  of 
fused  stimuli,  that  the  fused  percept  sounds  more  like  one  component  than  like 
the  other,  or,  in  the  case  of  unfused  stimuli,  that  one  of  the  inputs  is  more 
often  correctly  identified  (or  stands  out  more  clearly)  than  the  other.  One 
of  the  two  factors  affecting  perceptual  prominence  is  ear  dominance:  the 
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stimulus  in  one  ear  may  be  more  prominent  than  the  stimulus  in  the  other  ear. 
The  other  factor  is  stimulus  dominance : in  a particular  stimulus  combination, 
one  stimulus  may  be  more  prominent  than  the  other,  regardless  of  the  ear  to 
which  it  is  presented.  Finally,  it  is  possible  that  the  perceptual  result  of 
dichotic  interaction  does  not  resemble  either  of  the  two  competing  stimuli  but 
is  a more  complex  composite  of  the  two--a  consequence  of  higher-level  fusion 
(see  Cutting,  1976).  Ear  dominance,  stimulus  dominance,  and  higher-level 
fusion  are  probabilistic  factors  whose  effects  can  be  estimated  only  from  a 
large  number  of  trials;  they  also  vary  substantially  from  listener  to 
listener.  Lower-level  fusion,  on  the  other  hand,  is  basically  deterministic; 
it  may  vary  in  degree  between  different  stimulus  combinations,  but  it  usually 
does  not  vary  over  time  or  between  individuals. 

The  theoretical  and  methodological  significance  of  these  factors  for 
studying  dichotic  competition  between  speech  sounds  has  been  discussed  in 
several  recent  papers  (Cutting,  1976;  Repp,  1976b,  1977a,  1977b).  Dichotic 

listening  tasks  are  used  not  only  to  gain  insight  into  the  mechanisms  of 
speech  perception  and  selective  attention  but  also — and  more  frequently — to 
assess  lateral  asymmetries  in  perception  that  reflect,  at  least  in  part, 
functional  lateralization  in  the  brain.  In  the  past,  the  large  majority  of 
studies  has  used  unfused  stimuli  for  these  purposes,  which  were  heard  as  two 
separate  events,  so  that  two  responses  were  required  on  each  trial.  Fused  (or 
partially  fused)  stimuli,  on  the  other  hand,  require  only  a single  response, 
which  offers  certain  methodological  advantages,  especially  with  respect  to 
measuring  laterality  effects.  There  is  a close  formal  analogy  between  ear 
dominance  and  stimulus  dominance  in  the  dichotic  single-response  paradigm  on 
the  one  hand,  and  sensitivity  and  bias  in  a signal-detection  situation  on  the 
other  hand.  As  sensitivity  can  be  measured  by  varying  bias  and  deriving  a 
receiver-operating-characteristic  (ROC)  function  and  an  associated  index  of 
sensitivity  (Green  and  Swets,  1966),  so  ear  dominance  can  be  measured  by 
varying  stimulus  dominance  and  deriving  an  appropriate  ROC  (isolaterality) 
function  and  an  associated  index  (Repp,  1977b).  This  analogy  makes  the 
dichotic  single-response  paradigm  singularly  attractive  as  a tool  for  measur- 
ing laterality  effects.  The  traditional  two-response  paradigm  is  much  less 
suited  for  the  application  of  signal  detection  procedures. 

Several  assumptions  and  prerequisites  must  be  satisfied  before  the 
methods  of  signal  detection  theory  can  be  effectively  applied  to  the  results 
of  a dichotic  test  using  fused  stimuli.  First  of  all,  ear  dominance  and 
stimulus  dominance  must  be  independent  factors — an  assumption  that  seems 
reasonable  enough  to  be  accepted  here  without  further  discussion.  Second,  any 
dichotic  test  contains  a number  of  different  stimulus  combinations,  and  it  is 
necessary  that  ear  dominance  does  not  vary  as  a function  of  these  different 
combinations,  except  for  random  measurement  error.  In  order  to  satisfy  this 
(testable)  assumption,  the  set  of  stimuli  used  in  a test  should  be  as 
homogeneous  as  possible  in  terms  of  their  phonetic  and  auditory  properties. 
Third,  each  stimulus  combination  should  receive  only  two  different  categories 
of  responses,  reflecting  perceptual  dominance  of  one  or  the  other  component 
stimulus.  Stimulus  combinations  that  are  subject  to  higher-level  fusion  and 
generate  more  than  two  categories  of  responses  are  not  desirable.  Fourth, 
stimulus  dominance  in  different  stimulus  combinations  should  vary  systemati- 
cally over  a wide  range.  Individual  subjects  often  bring  very  different 
biases  to  a task,  and  a wide  variation  of  intrinsic  (that  is,  average  or 


expected)  stimulus  dominance  relationships  makes  it  less  likely  that  some 
individuals  show  such  strong  biases  in  most  stimulus  pairs  that  their 
sensitivity  (ear  dominance)  can  no  longer  be  measured  reliably.  In  addition, 
systematic  variation  of  stimulus  dominance  permits  the  actual  derivation  of  an 
ROC  function  whose  shape  determines  the  index  of  sensitivity  to  be  used. 
Fifth,  for  the  whole  effort  to  be  worthwhile,  the  stimuli  used  must  generate 
reliable  asymmetries  and  individual  differences  in  ear  dominance.  We  assume 
that  ear  dominance  varies  in  degree  between  individuals  and  can  be  measured  on 
(at  least)  an  ordinal  scale  (cf.  Shankweiler  and  Studdert-Kennedy , 1975). 


The  set  of  stimuli  most  frequently  used  in  recent  dichotic  studies  is 
comprised  of  the  consonant-vowel  syllables,  /b a/,  /da/,  /ga/,  /pa/,  /ta/ , and 
/ka/.  The  fifteen  possible  combinations  of  these  syllables  fall  into  three 
sets:  place  contrasts  (/ba/-/da/,  /ba/-/ga/,  /da/-/ga/,  /pa/-/ta/,  /pa/-/ka/, 
and  /ta/-/ka/),  voicing  contrasts  (/ba/-/pa/,  /da/-/ta/,  and  /ga/-/ka/),  and 
double-feature  contrasts  (/ba/-/ta/,  /ba/-/ka/,  /da/-/pa/,  /da/-/ka/,  /ga/- 
/pa/,  and  /ga/-/ta/).  Dichotic  (voiced)  place  contrasts  have  been  investigat- 
ed in  detail  by  Repp  (1976b).  When  precisely  aligned  and  minimally  distinc- 
tive synthetic  syllables  are  used,  these  stimuli  fuse  perfectly  and  are 
virtually  indistinguishable  from  binaural  syllables.  This  makes  them  ideal 
for  the  single-response  paradigm.  However,  they  yield  only  small  ear  advan- 
tages— perfect  fusion  seems  to  prevent  strong  lateral  asymmetries.  Some  of 
the  place  contrasts  yield  a third  response  category  ("psychoacoustic  fusion" — 
cf.  Cutting,  1976).  Although  stimulus  dominance  relationships  can  be  varied 
to  some  degree  by  changing  the  acoustic  structure  of  the  stimuli,  there  are 
limits  to  this  variation,  and  informal  observations  have  shown  that  some 
stimulus  combinations  have  extreme  intrinsic  biases  that  are  difficult  to 
remove.  Thus,  place  contrasts  (voiced  place  contrasts,  at  least;  voiceless 
place  contrasts  have  not  been  investigated  in  detail)  are  problematic  with 
regard  to  three  of  the  requirements  named  above. 

Double-feature  contrasts  were  investigated  in  detail  by  Repp  (1977a). 
Precisely  aligned  synthetic  syllables  contrasting  only  in  the  relevant  acous- 
tic parameters  (voice  onset  time,  formant  transitions)  are  "partially  fused," 
that  is,  the  perfectly  fused  vowel  portion  is  preceded  by  an  unfused  portion 
of  very  brief  duration.  This  unfused  portion  results  from  aspiration  noise  in 
one  ear  (the  initial  portion  of  the  voiceless  stimulus)  being  accompanied  by  a 
different  noise  and/or  a periodic  waveform  in  the  other  ear  (the  initial 
portion  of  the  voiced  stimulus)  and  lasts  as  long  as  the  voice  onset  time 
(VOT)  of  the  voiceless  stimulus--perhaps  50  msec.  The  perceptual  result  is  a 
single  stimulus  accompanied  by  some  brief  noise  in  one  or  the  other  ear. 
Thus,  the  single-response  paradigm  is  appropriate  for  such  partially  fused 
stimuli.  Repp  (1977a)  showed  that  surprisingly  large  right-ear  advantages  are 
obtained  in  such  a test,  and  that  stimulus  dominance  can  be  varied  by  changing 
the  acoustic  structure  of  the  stimuli  (particularly  their  VOTs ) . The  only 
problem  with  double-feature  contrasts  is  that  they  lead  to  a large  number  of 
"blend"  responses,  that  is,  to  four  response  categories  for  each  stimulus 
combination  (cf.  Cutting,  1976).  Blend  responses  are  a kind  of  higher-level 
fusion  and  convey  no  direct  information  about  ear  dominance.  When  calculating 
ear  dominance  indices,  either  a large  amount  of  the  data  must  be  discarded,  or 
separate  ear  dominance  indices  must  be  calculated  for  the  voicing  and  place 
dimensions,  which  raises  methodological  problems,  since  these  indices  are 
often  not  equal  (Repp,  1977a). 
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Thus,  we  are  left  with  the  voicing  contrasts.  They  are  partially  fused, 
like  double-feature  contrasts,  because  of  the  difference  in  VOT  between  the 
two  stimuli.  For  each  stimulus  pair,  there  are  two  different  response 
categories  that  correspond  to  the  two  component  stimuli.  (For  example,  /ba/- 
/pa/  always  sounds  either  more  like  /ba/  or  more  like  /pa/.)  If  it  could  be 
shown  that  stimulus  dominance  can  be  varied  systematically  and  that  large  and 
reliable  ear  advantages  are  obtained,  voicing  contrasts  would  constitute  an 
optimal  dichotic  test  for  measuring  ear  dominance. 

Although  the  preceding  paragraphs  emphasized  methodological  considera- 
tions, the  present  studies  were  equally  motivated  on  theoretical  grounds. 
(See  points  3 and  4 below.)  These  were  the  main  objectives: 

(1)  To  demonstrate  large  right-ear  advantages  and  reliable  individual 
differences  in  ear  dominance  for  voicing  contrasts  in  a single- 
response paradigm.  This  was  an  attempt  to  replicate  the  extraordi- 
narily large  effects  obtained  by  Repp  (1977a)  with  double-feature 
contrasts.  If  it  is  the  voicing  distinction  (and  the  resulting 
partial  fusion  of  the  stimuli)  that  matters,  similarly  large  effects 
should  be  obtained  with  voicing  contrasts. 

(2)  To  demonstrate  that  intrinsic  stimulus  dominance  relationships  can 
be  systematically  changed  by  varying  the  VOTs  of  the  component 
stimuli.  This  also  amounted  to  an  extension  of  Repp  (1977a),  with 
more  attention  to  detail  and  to  the  range  over  which  changes  are 
possible . 

(3)  To  investigate  the  rule  by  which  competing  dichotic  VOTs  are 

perceptually  integrated.  Repp  (1977a)  obtained  a curious  interac- 
tion: when  the  voiceless  stimulus  in  one  ear  had  a VOT  of  +40  msec, 

a change  in  the  VOT  of  the  voiced  stimulus  in  the  other  ear  from  0 
to  +15  msec  reduced  the  percentage  of  voiced  responses,  as  expected; 
however,  when  the  voiceless  stimulus  had  a VOT  of  +55  msec,  the  same 
change  in  the  voiced  stimulus  had  a slight  effect  in  the  opposite 
direction.  The  present  studies  attempted  to  clarify  this  interac- 
tion by  using  more  steps  on  the  VOT  dimension  and  by  factorially 
combining  different  VOTs  of  voiced  and  voiceless  stimuli.  Assuming 
that  the  interaction  would  no  longer  be  obtained  or  could  otherwise 
be  accounted  for;  the  question  may  be  asked:  According  to  what  rule 

are  competing  VOTs  integrated  into  a single  percept?  Can  the 
process  be  described  by  an  additive  or  by  a multiplicative  model? 
The  theory  of  functional  measurement  provides  an  appropriate  frame- 
work for  this  purpose  (Anderson,  1974;  Massaro  and  Cohen,  1976). 

(4)  To  investigate  the  shape  of  the  ROC  (isolaterality)  function  con- 
necting points  of  equal  ear  dominance  at  different  levels  of 
stimulus  dominance.  Since  variations  in  stimulus  dominance  were  to 
be  produced  by  varying  VOT  only,  and  since  the  place  feature  was 
held  constant  in  each  test,  the  stimuli  were  maximally  homogeneous, 
and  constancy  of  ear  dominance  with  changes  in  VOT  could  be  safely 
assumed.  Nevertheless,  this  assumption  could  be  tested  by  examining 
the  scatter  of  the  data  points  for  individual  stimulus  pairs.  If 
they  do  not  lie  on  any  single  smooth  function,  lack  of  homogeneity 


would  be  indicated.  If  they  do,  the  shape  of  the  function  would  be 
of  great  theoretical  and  practical  interest.  Repp  (1977b)  proposed 
an  index  of  ear  dominance  (called  e1 ) based  on  the  assumption  that 
the  data  points  (when  plotted  as  "hits"  against  "false  alarms,"  as 
described  below)  follow  a linear  function  that  passes  through  the 
origin  of  the  unit  square,  that  is,  a linear  approximation  to  the 
standard  ROC  function  of  signal  detection  theory.  The  e'  index  was 
first  applied  by  Repp  ( 1977a),  and  the  data  of  that  experiment 
supported  the  assumptions,  although  there  was  considerable  scatter 
(see  Repp,  1977b,  Figure  4).  The  present  studies  provided  an 
opportunity  for  further  tests  of  the  assumptions  underlying  the  e' 
index . 


Method 


EXPERIMENT  I 


Subjects . The  subjects  were  eight  Yale  undergraduates,  paid  volunteers, 
some  of  whom  had  participated  in  earlier  experiments  using  synthetic  speech 
and  dichotic  listening.  In  addition,  the  author  and  a colleague,  both  highly 
experienced  listeners,  participated.  Two  of  the  less  experienced  subjects 
were  left-handers.  Three  additional  subjects  were  excluded  because  of  poor 
performance  and  mistakes  that  precluded  data  analysis. 

St imuli . The  stimuli  were  generated  on  the  Haskins  Laboratories  parallel 
resonance  synthesizer  and  similar  to  those  used  in  Repp  (1977a).  All 
syllables  were  300  msec  long,  had  no  initial  bursts  and  a constant  fundamental 
y.  frequency  (90  Hz).  Different  VOTs  were  generated  by  setting  the  amplitude  of 

51  . the  first  formant  to  zero  and  exciting  the  higher  formants  with  a random  noise 

V;  source  for  the  time  specified.  There  were  eight  different  VOTs:  four 

appropriate  for  voiced  consonants  (0,  +5,  +10,  and  +15  msec)  and  four 
appropriate  for  voiceless  consonants  (+40,  +45,  +50,  and  +55  msec). 

There  were  two  parallel  series  of  stimuli,  one  containing  only  labials 
(/ba/-/pa/),  the  other  only  velars  (/ga/-/ka/).  The  stimuli  were  digitized 
(with  a random  sampling  error  of  0.125  msec)  and  recorded  on  tape  using  the 
Haskins  Laboratories  PCM  system.  Each  dichotic  series  was  preceded  by  80 
binaural  syllables — a randomized  sequence  of  the  8 stimuli  repeated  10  times. 
Each  dichotic  sequence  contained  320  stimulus  pairs:  the  16  VOT  combinations 

(4  voiced  stimuli  paired  with  4 voiceless  stimuli)  in  the  two  possible  channel 
assignments,  in  10  successive  individually  randomized  blocks  of  32.  The 

dichotic  stimuli  were  onset-aligned,  with  a maximal  error  of  0.125  msec.  The 
interstimulus  interval  (ISI)  was  3 sec. 


Procedure . The  tapes  were  played  back  on  an  Ampex  AG-500  tape  recorder, 
and  the  subjects  listened  over  Telephonies  TDH-39  earphones.  The  intensities 
of  the  two  channels  were  carefully  equalized  at  approximately  75  dB  SPL  (peak 
deflections  on  a voltmeter).  The  channels  were  reversed  electronically 
between  the  labial  and  velar  stimulus  series  whose  order  was  counterbalanced 
across  subjects.  The  subjects  were  not  informed  about  the  nature  of  the 
dichotic  stimuli.  Their  task  was  to  rate  each  stimulus  heard  on  a six-point 
scale.  Ratings  1-3  signified  /b a/  or  /ga/  (1  — a clear  instance,  3— ambiguous 
but  more  like  /ba/  or  /ga/);  ratings  4-6  signified  /pa/  or  /ka/  (4 — ambiguous 
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but  more  like  /pa/  or  /ka/ , 6 — a clear  instance).  It  became  obvious  in  the 
analysis  that  the  ratings  did  not  provide  any  significant  information  beyond 
that  obtained  from  simply  collapsing  the  ratings  into  two  categories,  voiced 
(1-3)  and  voiceless  (4-6).  Therefore,  the  results  are  reported  here  in  terms 
of  percentages  of  voiced  responses. 

Results 


Stimulus  Dominance . The  stimuli  were  all  reliably  categorized  in  isola- 
tion (binaural  presentation).  The  percentages  of  voiced  responses  are  shown 
in  Table  1.  The  few  errors  that  occurred  reflected  the  different  locations  of 
the  category  boundaries  on  the  labial  and  velar  continua.  The  average 
boundary  on  a labial  continuum,  such  as  the  present  one,  is  typically  at  a VOT 
of  +25  msec,  while  that  on  a similar  velar  continuum  occurs  around  a VOT  of 
+30  msec  (Miller,  in  press).  Table  1 shows  that  the  /ba/  closest  to  the 
boundary  (VOT  = +15  msec)  received  some  /pa/  responses,  while  the  correspond- 
ing /ga/  was  consistently  identified.  On  the  other  hand,  the  /pa/s  were  more 
consistenty  identified  than  the  /ka/s.  Because  of  these  differences,  dichotic 
pairs  of  velar  stimuli  were  expected  to  receive  more  voiced  responses  than 
pairs  of  labial  stimuli. 


TABLE  1:  Percentages  of  voiced  responses  to  the  stimuli  presented  binaurally. 


VOT 

Labials 

Velars 

0 

100.0 

100.0 

+ 5 

100.0 

100.0 

+ 10 

99.0 

100.0 

+ 15 

93.0 

100.0 

+40 

5.0 

5.0 

+45 

1.0 

5.0 

+ 50 

0.0 

4.0 

+55 

0.0 

6.0 

This  expectation  was  borne  out:  overall,  labials  received  44.4  percent 
voiced  responses  and  velars  50.0  percent.  However,  this  effect  did  not  reach 
significance  because  two  subjects  showed  a difference  in  the  opposite  direc- 
tion and  one  subject  showed  none  at  all.  The  overall  percentages  of  voiced 
responses  show  that  the  dichotic  stimuli  were  perceptually  well  balanced;  they 
were  not  heard  as  predominantly  voiced  or  voiceless.  Individual  differences 
in  the  overall  percentages  of  voiced  responses  ranged  from  25.9  to  63.0 
percent . 

The  effects  of  the  variations  in  VOT  are  shown  in  Figure  1,  separately 
for  labials  and  velars.  The  four  functions  correspond  to  the  four  different 
VOTs  of  voiceless  stimuli;  the  VOTs  of  the  voiced  stimuli  are  on  the  abscissa. 
Thus,  the  vertical  separation  between  the  functions  represents  the  effect  of 
varying  the  VOT  of  the  voiceless  component,  while  deviations  from  horizontali- 


ty  represent  the  effect  of  varying  the  VOT  of  the  voiced  component.  It  was 
expected  that  the  functions  would  be  well  separated  (with  the  longest  VOT  at 
the  bottom),  monotonical ly  decreasing  from  left  to  right,  and  parallel  (if  the 
two  VOT  effects  are  linearly  independent  and  an  additive  model  applies). 

The  functions  obtained  were  remarkably  irregular.  Clearly,  the  VOT  of 
the  voiceless  stimulus  had  a pronounced  effect  (F3  27  = 48.6,  p <<  .01).  The 
effect  of  the  VOT  of  the  voiced  stimulus  seemed  erratic  but  was  nevertheless 
significant  (F3  27  = 6.5,  p < .01).  The  four  functions  were  far  from 
parallel,  which  was  reflected  in  a highly  significant  interaction  of  the  two 
VOT  effects  (Fg  gj  = 6.7,  p <<  .01).  In  addition,  each  VOT  effect  interacted 
with  the  place  (labial  vs.  velar)  factor  (F3  27  = 3.0,  p < .05,  and  F3  27  = 
7.5,  p < .01,  respectively). 

At  first,  these  results  seemed  extremely  puzzling.  The  significant 
effects  obtained  showed  that  the  irregularities  were  not  just  random  varia- 
tion, and  it  was  also  obvious  that  alternate  pairs  of  functions  in  Figure  1 
showed  a certain  parallellism,  although  it  was  not  clear  why.  There  must  have 
been  some  uncontrolled  factor  in  the  experiment  that  influenced  stimulus 
dominance.  Although  Repp  (1977a)  also  had  obtained  an  interaction  between  the 
two  VOT  effects,  the  precise  pattern  of  this  interaction  was  not  replicated, 
which  added  to  the  confusion. 

Eventually,  the  solution  was  found  in  the  relative  phase  of  the  periodic 
portions  of  the  dichotic  stimuli.  The  present  stimuli  had  a fundamental 
frequency  of  90  Hz,  so  that  one  period  of  the  waveform  lasted  11.1  msec.  The 
periodicity  began  at  the  VOT  specified  and  continued  with  a new  pulse 
occurring  every  11.1  msec.  Since  the  VOTs  were  specified  in  5-msec  steps,  the 
periodic  waveforms  of  some  stimulus  combinations  were  nearly  in  phase , while 
others  were  completely  out  of  phase.  For  example,  the  fourth  and  fifth  pitch 
pulses  of  a stimulus  with  a VOT  of  0 occurred  at  44.4  and  55.5  msec, 
respectively.  Thus,  this  stimulus  was  nearly  in  phase  with  stimuli  whose  VOTs 
were  +45  or  +55  msec,  but  it  was  out  of  phase  with  stimuli  whose  VOTs  were  +40 
or  +50  msec.  Table  2 shows  the  phase  relationships  for  the  different  stimulus 
combinat ions . 


TABLE  2:  Phase  relationships  between  the  periodic  stimulus  portions  in  all 

dichotic  combinations.  (Temporal  asynchronies  between  pitch  pulses 
in  msec.) 


VOT 

+40 

+45 

+ 50 

+ 55 

0 

-4.4 

0.6 

-5.5 

-0.5 

+5 

1.7 

-4.4 

0.6 

-5.5 

+ 10 

-3.3 

1.7 

-4.4 

0.6 

+ 15 

2.8 

-3.3 

1.7 

-4.4 

Note:  Negative  values  indicate  that  the  first  pulse  of  the  voice- 
less stimulus  preceded  the  temporally  closest  pulse  of  the  voiced 
st imulus . 


It  is  clear  from  Table  2 that  stimuli  whose  VOTs  differed  by  a multiple 
of  10  were  out  of  phase  (the  maximal  pulse  asynchrony  being  5.5  msec),  while 
the  remaining  stimulus  combinations  were  more  or  less  in  phase.  The  data  of 
Figure  1 are  replotted  in  Figure  2,  separately  for  in-phase  and  out-of-phase 
stimulus  pairs.  To  save  space,  labials  and  velars  have  been  combined  in  this 
figure . 

The  transformation  of  the  irregular  pattern  of  Figure  1 into  the  orderly 
pattern  of  Figure  2 is  quite  remarkable.  In  particular,  it  turned  out  that, 
once  the  data  were  partitioned  according  to  phase,  the  four  functions  were 
nearly  parallel  within  each  set  of  data.  In-phase  and  out-of-phase  pairs  were 
analyzed  separately  in  4-way  analyses  of  variance.  Three  of  the  factors  were 
place  (labial  vs.  velar),  VOT  of  voiced  stimulus  (0  and  +5  vs.  +10  and  +15), 
and  VOT  of  voiceless  stimulus  (+40  and  +45  vs.  +50  and  +55).  The  fourth 

factor  ("VOT  shift")  represented  the  effect  of  simultaneous  5-msec  changes  in 
the  VOTs  of  both  stimuli — the  difference  between  the  solid  and  the  dashed 

functions  in  Figure  2. 

For  in-phase  pairs,  there  was  a significant  decrease  in  the  percentage  of 
voiced  responses  as  the  VOT  of  the  voiced  stimulus  increased  (Fj  9 = 11.9,  p < 
.01),  an  H an  even  larger  decrease  as  the  VOT  of  the  voiceless  stimulus 
incre  1,9  = 36.3,  p <<  .01).  The  interaction  between  the  two  factors 

was  significant,  which  confirms  the  parallelism  of  the  functions  in 

Fig  he  main  effect  of  VOT  shift  was  not  significant  either.  Note 

th  tactor  represents  here  simultaneous  5-msec  changes  in  the  VOTs  of 

th.  . »o  stimuli  in  opposite  directions.  Thus,  the  two  VOT  effects  cancelled 
each  other  in  this  case,  despite  the  fact  that,  otherwise,  VOT  changes  in  the 
voiceless  stimulus  had  a larger  effect  than  VOT  changes  in  the  voiced 

stimulus.  The  only  other  significant  effect  was  a triple  interaction  between 

place,  VOT  shift,  and  VOT  of  the  voiced  stimulus  (Fj  9 = 8.8,  p < .05).  It 
was  due  to  the  fact  that,  in  velars,  a change  in  VOT  from  0 to  +10  resulted  in 
a decrease  in  voiced  responses,  but  a change  from  +5  to  +15  did  not;  in 

labials,  the  pattern  was  reversed,  if  anything  (cf.  Figure  1). 

For  out-of-phase  pairs,  there  was  a significant  effect  of  the  VOT  of  the 
voiced  stimulus  (Fj^  = 17.9,  p < .01),  but,  surprisingly,  it  went  in  the 

opposite  direction:  the  percentage  of  voiced  responses  increased  with  the  VOT 

of  the  voiced  stimulus!  The  effect  of  the  VOT  of  the  voiceless  stimulus  was 
in  the  expected  direction  and  highly  significant  (Fj  9 = 63.1,  p <<  .01),  and 

so  was  the  effect  of  VOT  shift  (Fj^  = 35.9,  p <<  .01).  Here,  VOT  shift 

represented  simultaneous  5-msec  changes  in  the  VOTs  of  the  two  stimuli  in  the 
same  direction.  The  inverted  effect  of  the  VOT  of  the  voiced  stimulus  was 
apparently  not  strong  enough  to  cancel  the  effect  of  the  VOT  of  the  voiceless 
stimulus.  Again,  the  interaction  between  the  two  VOT  effects  was  far  from 
significant,  confirming  the  parallellism  of  the  functions  in  Figure  2b.  There 
was  a highly  significant  interaction  between  place  and  VOT  of  the  voiced 
stimulus  (F19  = 22.2,  p < .01),  and  a marginally  significant  interaction 

between  place  and  VOT  shift  (Fj^  = 5.3,  p < .05).  The  first  interaction 
resulted  from  the  fact  that  the’  inverted  effect  of  the  VOT  of  the  voiced 
stimulus  was  entirely  due  to  the  velar  stimuli  (cf.  Figure  1);  in  labial 

stimuli,  the  factor  had  no  systematic  effect  at  all.  The  other  interaction 
was  negligible. 


Thus,  the  primary  effect  of  phase  was  on  the  effect  of  the  VOT  of  the 
voiced  stimulus.  In  addition,  in-phase  pairs  generally  received  more  voiced 
responses  than  out-of-phase  pairs  (cf.  Figure  2). 

Ear  Dominance . The  expected  large  right-ear  advantages  were  obtained. 
The  ear  dominance  coefficients  for  the  individual  subjects  are  shown  in  Table 
3,  separately  for  the  two  tests. 


TABLE  3:  Ear  dominance  coefficients  for  individual  subjects  in  the  two  tests 

(e1  coefficients — see  Repp,  1977b). 


Subject 

Labials 

Velars 

la 

+0.89 

+0.87 

2 

+0.44 

+0 . 08e 

3 

+0.54 

+0.96 

4b 

-0.59 

+0 . 01e 

5a 

+0.70 

+0.71 

6 

+0.74 

+0.75 

7 

+0. 15e 

+0.62 

8 

+0.54 

+0.59 

BHRC 

+0.95 

+0.64 

AQSd 

+0.65 

+0.86 

aLeft-handers . 

^This  subject's  coefficients  may  have  incorrect  signs  (see  text). 
cThe  author, 
colleague. 

eNot  significant.  All  other  coefficients  are  significant  at  p<.05 
or  better  (see  Repp,  1977b,  for  procedure). 


It  can  be  seen  that  all  subjects  but  one  showed  large  right-ear  dominance  in 
at  least  one  test.  This  includes  the  two  left-handers.  (One  of  them  had  also 
participated  in  Repp,  1977a,  and  shown  a large  right-ear  advantage  there;  the 
same  is  true  for  the  author).  The  only  large  left-ear  advantage  (subject  4) 
may  represent  a mistake  in  recording  the  channel-to-ear  assignments  for  this 
subject.  (He  was  later  retested  in  a dichotic  test  using  similar  stimuli  and 
shewed  a moderate  right-ear  advantage.)  The  average  ear  dominance  coefficients 
were  similar  for  the  two  tests  (+0.50  for  labials — +0.62,  if  the  sign  of  the 
only  left-ear  advantage  is  reversed — and  +0.61  for  velars)  and  comparable  to 
that  obtained  with  double-feature  contrasts  (+0.55 — Repp,  1977a). 

It  is  noteworthy,  however,  that  six  of  the  ten  subjects  showed  substan- 
tial differences  in  ear  dominance  between  the  two  tests.  These  differences 
were  not  related  to  order  of  presentation.  The  correlation  between  the  two 
tests  was  +0.70  (but  only  +0.29  if  the  sign  of  the  only  left-ear  advantage  was 
reversed!).  This  correlation  does  not  reflect  low  test  reliability.  Stepped- 
up  test-retest  reliabilities  (obtained  by  correlating  ear  dominance  coeffi- 
cients for  the  first  and  second  halves  of  each  test  and  subsequent  application 


of  the  Spearman-Brown  formula — see  Lord  and  Novick,  1968,  p . 1 1 2 ) for  the 
labial  and  velar  tests  were  +9.99  and  +0.93,  respectively  (or  +0.95  and  +0.92, 
respectively,  if  the  signs  of  the  ear  dominance  coefficients  for  subject  4 
were  reversed).  Thus,  the  correlation  between  the  present  tests  was  distinct- 
ly lower  than  their  reliabilities.  All  correlations  were  probably  somewhat 
overestimated  due  to  the  small  number  of  subjects  and  the  large  between- 
subject  variance.  A more  thorough  evaluation  of  the  reliability  of  the 
present  tests  will  require  a larger  subject  sample. 

The  present  experiment  offered  an  opportunity  to  test  the  assumptions 
underlying  the  e'  index  of  ear  dominance,  as  well  as  the  assumption  of  test 
homogeneity.  The  homogeneity  assumption  says  that  the  data  points  for 
individual  stimulus  combinations — when  plotted  as  "hits"  against  "false 
alarms",  as  described  below — should  lie  on  a single,  monotonic  function, 
except  for  random  variability.  The  e'  index  is  based  on  the  assumption  that 
this  function — the  isolaterality  contour  or  ROC  function — is  a linear  (or 
slightly  curvilinear)  function  that  passes  through  the  origin  of  the  unit 
square.  The  unit  square  is  the  plot  of  the  proportions  of  hits  against  the 
proportions  of  false  alarms,  familiar  from  signal  detection  theory. 

The  ROC  function  must  be  symmetric  around  the  negative  diagonal,  because 
of  the  complementarity  of  the  two  response  categories,  each  of  which  may  be 
divided  into  hits  and  false  alarms.  Therefore,  it  was  sufficient  to  consider 
only  the  less  frequent  response  to  each  stimulus  pair,  and  thus  only  the  area 
below  the  negative  diagonal  of  the  unit  square  (Repp,  1977b).  For  example,  if 
voiceless  responses  were  less  frequent  than  voiced  responses  for  a given 
stimulus  pair,  then  voiceless  responses  given  that  the  voiceless  stimulus  was 
in  the  right  ear  constituted  hits,  and  voiceless  responses  given  that  the 
voiceless  stimulus  was  in  the  left  ear  constituted  false  alarms.  These 
proportions  were  averaged  over  all  subjects  (excluding  subject  4)  for  each 
individual  stimulus  combination,  so  that  32  data  points  were  obtained.  They 
are  plotted  in  Figure  3. 

The  results  were  disappointing.  The  32  data  points  clustered  in  a 
roughly  circular  area  in  the  left-hand  quadrant  of  the  unit  square.  The 
stimuli  were  homogeneous  in  so  far  as  all  shovjed  sizeable  average  right-ear 
advantages.  However,  variability  was  so  large  that  it  was  impossible  to 
determine  a single  function  that  fitted  the  point  swarm  well.  The  variation 
was  not  systematically  related  to  either  the  place  distinction  or  the  phase 
factor.  All  that  can  be  concluded  is  that  there  was  large,  probably  random 
variation  between  stimulus  pairs.  For  a critical  test  of  the  assumptions 
underlying  the  e'  index,  either  more  observations  per  stimulus  pair  or  more 
extreme  stimulus  dominance  relationships  are  needed. 

Discussion 

The  present  study  achieved  several  of  its  goals.  It  showed  that  stimulus 
dominance  relationships  in  voicing  contrasts  can  be  systematically  changed  by 
varying  the  VOTs  of  the  component  stimuli,  particularly  the  VOT  of  the 
voiceless  stimulus  in  a pair.  After  taking  the  phase  factor  into  account,  it 
became  clear  that  the  perceptual  integration  of  the  VOTs  of  the  two  stimuli 
was  approximately  linear  and  additive.  Most  subjects  showed  extremely  large 
right-ear  advantages,  which  replicated  Repp  (1977a).  Only  the  question  of 
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item  homogeneity  and  the  shape  of  the  isolaterality  contour  remained  undecid- 
ed, but  at  least  the  results  did  not  directly  contradic*:  the  assumptions 

underlying  the  e'  index  of  ear  dominance. 


A puzzle  was  created  by  the  unexpected  effect  of  the  relative  phase  of 
the  stimuli.  Why  did  phase  have  any  effect  at  all?  Why  was  the  effect  of  the 
VOT  of  the  voiced  stimulus  reversed  when  the  stimuli  were  out  of  phase?  And 
why  did  out-of-phase  stimuli  receive  fewer  voiced  responses?  Although  phase 
may  be  expected  to  have  some  effect  on  fusion,  there  was  no  indication  that 
in-phase  and  out-of-i-hase  stimulus  pairs  were  phenomenologically  different. 
To  tht  author,  the  test  sequences  seemed  perceptually  quite  homogeneous,  and 
no  stimulus  pairs  sounded  less  fused  than  others. 


Note  that  relative  pnase  applied  only  to  the  simultaneous  periodic 
portions  of  the  stimuli,  that  is,  the  vocalic  portions  after  voicing  onset  in 
the  voiceless  stimulus.  Therefore,  it  was  especially  surprising  that  phase 
changed  the  effect  of  the  VOT  of  the  voiced  stimulus,  since  this  voicing  onset 
occurred  at  a time  where  phase  could  not  yet  have  played  a role.  One  way  of 
describing  the  results  would  be  that  the  voicing  feature  of  in-phase  stimuli 
was  determined  by  a weighted  average  of  the  VOTs  of  the  two  stimuli;  but,  in 
out-of-phase  stimuli,  it  was  determined  by  the  weighted  difference  of  the  two 
VOTs.  If  the  difference  between  the  two  competing  VOTs  played  a role,  an 
inverted  effect  of  the  VOT  of  the  voiced  stimulus  would  be  expected,  as  well 
as  a relative  decrease  in  voiced  responses.  Perhaps,  the  decision  mechanism 
responsible  for  the  voicing  feature  is  sensitive  to  the  intervals  between  any 
abrupt  changes  in  energy  at  and  shortly  after  stimulus  onset.  Normally,  there 
are  only  two  such  energy  increments:  stimulus  onset  and  voicing  onset.  In 
the  present,  partially  fused  stimuli,  however,  there  were  three:  stimulus 
onset,  voicing  onset  in  the  voiced  stimulus,  and  voicing  onset  in  the 
voiceless  stimulus.  Thus,  there  were  three  temporal  intervals,  and  the 
probability  of  hearing  a voiced  consonant  may  have  been  a weighted  function  of 
all  three.  That  intervals  other  than  VOT  can  affect  voicing  judgments  was 
demonstrated  by  Repp  (1976a).  He  found  that  the  interval  between  stimulus 
onset  in  one  ear  and  the  onset  of  an  isolated  vowel  in  the  other  ear  had  a 
significant  influence  on  the  probability  of  voiced  responses,  in  addition  to 
VOT.  However,  there  was  no  indication  in  these  data  that  the  interval  between 
voicing  onset  in  one  ear  and  vowel  onset  in  the  other  ear  played  a role, 
although  the  relative  phase  of  the  dichotic  stimuli  varied,  as  in  the  present 
experiment.  Thus,  the  phase  effects  obtained  here  remain  unexplained. 


Intriguing  as  the  phase  effect  was,  it  was  basically  an  ancillary  finding 
and  not  essential  to  the  theoretical  and  methodological  purpose  of  the  present 
experiment.  Therefore,  it  was  decided  not  to  investigate  the  effect  further, 
for  the  time  being,  but  instead  to  attempt  to  replicate  Experiment  I with 
stimuli  that  were  definitely  in  phase.  (Even  the  in-phase  stimuli  of 
Experiment  I were  slightly  out  of  phase.)  This  was  achieved  in  Experiment  II 
by  choosing  the  VOTs  of  the  voiceless  stimuli  so  that  they  coincided  with 
pitch  pulses  in  the  voiced  stimuli. 
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EXPERIMENT  II 


Method 


Subjects . The  subjects  were  seven  paid  volunteers  and  the  author. 
Again,  the  subjects  had  had  varying  degrees  of  exposure  to  synthetic  speech. 
One  subject  was  left-handed.  The  data  of  one  additional  subject  were  rejected 
because  they  were  too  variable. 

Stimuli . The  stimuli  for  this  experiment  were  generated  on  the  OVE  IIIc 
synthesizer  at  Haskins  Laboratories,  a serial  resonance  synthesizer  that 
permits  finer  control  of  certain  stimulus  parameters  and  tends  to  produce 
somewhat  more  "natural"  speech.  This  time,  a continuum  of  alveolar  stops  was 
selected,  ranging  perceptually  from  /da / to  /ta/.  All  stimuli  were  300  msec 
long  and  had  a constant  fundamental  frequency  of  125  Hz.  This  resulted  in  a 
period  of  8 msec,  and  the  VOTs  were  spaced  accordingly  in  8-msec  steps.  As  in 
Experiment  I,  there  were  eight  VOTs:  0,  +8,  +16,  +24,  +32,  +40,  +48,  and  +56 
msec.  Because  of  the  wider  spacing,  the  fourth  VOT  ( + 24  msec)  was  no  longer 
entirely  within  the  voiced  category  but  fell  in  the  region  of  the  phoneme 
boundary . 

VOT  was  varied  by  substituting  noise  for  periodic  excitation  and  setting 
the  first  formant  to  its  maximal  bandwidth.  The  pulse  generator  was  turned  on 
at  stimulus  onset  but  kept  at  minimum  amplitude  during  the  aspirated  portion 
of  the  signal;  this  insured  that  the  first  genuine  voicing  pulse  had  the 
intended  amplitude.  The  pulse  generator  of  the  synthesizer  was  synchronized 
to  stimulus  onset,  so  that  the  first  genuine  pitch  pulse  occurred  exactly  at 
the  VOT  specified.  Informal  observations  suggested  that  an  additional  factor 
influencing  dichotic  stimulus  dominance  is  the  relative  amplitude  of  the 
aspiration  noise  (which  can  be  controlled  in  the  OVE  synthesizer);  if  it  is 
too  low,  dichotic  voicing  contrasts  sound  predominantly  voiced,  and  if  it  is 
too  high,  they  sound  predominantly  voiceless.  In  the  synthesis  specifica- 
tions, the  amplitude  setting  for  the  noise  generator  was  selected  to  be  5 dB 
higher  than  the  subsequent  amplitude  setting  for  the  pulse  generator. 
However,  since  the  two  amplitude  parameters  were  not  on  the  same  scale,  the 
effective  amplitude  of  the  aspirated  portion  was  still  well  below  that  of  the 
vocalic  portion. 

The  stimuli  were  digitized  at  a 10  kHz  sampling  rate  (time-locked  to 
stimulus  onset).  The  stimulus  tape  contained  first  a series  of  80  binaural 
syllables — the  eight  stimuli  replicated  ten  times.  A sequence  of  five  blocks 
of  56  dichotic  pairs  followed.  Each  block  contained  all  possible  dichotic 
combinations  of  the  eight  stimuli  in  both  channel  assignments.  In  contrast  to 
Experiment  I,  within-category  combinations  were  included  here  to  facilitate 
the  task  by  providing  unambiguous  "anchor"  stimuli.  The  dichotic  stimuli  were 
onset-aligned  with  extreme  precision.  The  ISI  internal  was  3 sec. 

Procedure . The  rating  scale  was  no  longer  used;  the  subjects  simply 
responded  by  writing  down  D or  T.  After  listening  to  the  binaural  stimuli, 
each  subject  listened  to  the  dichotic  series  twice.  Channels  were  reversed 
electronically  before  the  repetition.  Stimulus  intensity  was  higher  than  in 
Experiment  I,  about  85  dB  SPL  (peak  deflections  on  a voltmeter).  The  more 
experienced  subjects  were  informed  about  the  dichotic  nature  of  the  stimuli. 
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Inexperienced  subjects  were  simply  told  to  identify  the  syllables  they  heard. 
All  subjects  were  told  to  ignore  any  noises  they  might  hear  accompanying  the 
stimuli.  In  the  present  stimuli,  the  acoustic  segregation  of  the  aspiration 
noise  of  the  stimulus  with  the  longer  VOT  was  somewhat  more  noticeable  than  in 
Experiment  I;  if  required,  it  would  not  have  been  difficult  to  tell  in  which 
ear  this  stimulus  occurred  on  a given  trial.  Nevertheless,  it  was  easy  to 
attend  to  the  fused  stimulus  in  the  middle  of  the  head,  and  the  subjects' 
comments  suggested  strongly  that  their  responses  were  not  contingent  on  where 
they  heard  the  aspiration  noise. 

Results 

Stimulus  Dominance . The  percentages  of  voiced  responses  to  the  eight 
binaural  stimuli  and  to  the  within-category  dichotic  pairs  are  shown  in  Table 
4. 


TABLE  4:  Percentages  of  voiced  responses  to  binaural  stimuli  and  dichotic 

within-category  combinations. 


VOT 

0 

+8 

+ 16 

+24 

0 

100.0 

+8 

100.0 

100.0 

+16 

100.0 

100.0 

100.0 

+24 

99.4 

98.8 

95.0 

51.3 

VOT 

+32 

+40 

+48 

+56 

+32 

1.3 

+40 

2.5 

0.0 

+48 

1.9 

1.9 

0.0 

+56 

0.0 

1.3 

2.5 

0.0 

It  can  be  seen  that  all  stimuli  but  one  were  identified  with  high 
consistency.  The  stimulus  with  VOT  = +24  fell  just  about  at  the  average 
category  boundary.  Most  of  the  errors  in  voiceless  stimulus  pairs  stemmed 
from  a single  subject  whose  data  were  somewhat  noisy.  The  most  interesting 
result  in  Table  4 is  that  there  were  hardly  any  voiceless  responses  to 
combinations  of  the  VOT  = +24  stimulus  with  the  three  voiced  stimuli;  the 
boundary  stimulus  was  almost  completely  dominated  by  stimuli  from  within  the 
voiced  category. 

The  results  for  the  between-category  combinations  are  shown  in  Figure  4. 
The  data  were  very  orderly  and  confirmed  the  predictions.  The  effect  of  the 
VOT  of  the  voiced  stimulus  was  highly  significant  ( F 3 = 47.9,  p <<  .01), 

and  so  was  the  effect  of  the  VOT  of  the  voiceless  stimulus  (F3  21=  38.5,  p << 
.01).  In  addition,  there  was  a significant  interaction  between  the  two 
factors  (F9)63  = 11.4,  p <<  .01). 
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The  interaction  was  due  to  the  VOT  = + 24  stimulus,  which  was  strongly 
dominated  not  only  by  voiced  but  also  by  voiceless  stimuli.  Likewise,  the 
voiceless  stimulus  closest  to  the  boundary  (VOT  = +32),  although  consistently 
identified  in  isolation,  was  strongly  dominated  by  the  voiced  stimuli  and  did 
not  completely  dominate  the  VOT  = +24  stimulus.  Thus,  both  stimuli  flanking 
the  boundary  were  weak  in  dichotic  competition.  The  data  were  reanalyzed 
omitting  these  two  stimuli.  When  only  combinations  of  "good  instances"  of 
each  category  were  considered,  the  effect  of  the  VOT  of  the  voiced  stimulus 
was  much  reduced,  but  nevertheless  in  the  predicted  direction  and  significant 
^f2  14  = P < .01);  the  effect  of  the  VOT  of  the  voiceless  stimulus  was 

more  pronounced  (F2  14  = 15.5,  p < .01),  and  there  was  no  longer  any 
significant  interaction.  In  other  words,  the  functions  were  again  parallel 
(cf.  Figure  4). 

Ear  Dominance . The  individual  e'  coefficients  for  the  eight  subjects  are 
shown  in  Table  5 . 


TABLE  5:  Individual  e'  coefficients  in  Experiment  II. 


Subject 


+0.14 
+0.21C 
+0 . 86 
-0.34 
+0 . 22c 
+0.68 
0.00c 
+0.88 


aLe  ft-handed . 

^The  author. 
cNot  significant, 
or  better. 


All  other  coefficients  are  significant  at  p<.05 


The  average  right-ear  advantage  in  this  test  was  smaller  than  in 
Experiment  I but  still  substantial  (e1  = +0.37,  based  on  average  scores). 
Only  two  subjects  and  the  author  showed  very  large  right-ear  advantages.  Of 
the  remaining  subjects,  four  showed  small  right-ear  advantages  and  one  a 
moderate  left-ear  advantage.  The  reliability  of  this  test  was  estimated  by 
the  split-half  method  to  be  +0.96,  although  some  subjects  showed  considerable 
variation.  The  reliability  is  again  somewhat  overestimated,  due  to  the  small 
subject  sample,  but  it  is  nevertheless  encouraging. 

Figure  5 shows  the  average  hit  and  false  alarm  proportions  for  the 
sixteen  stimulus  pairs.  It  can  be  seen  that,  in  this  test,  much  more 

variation  in  the  average  "bias"  was  obtained  than  in  Experiment  I — a conse- 
quence of  including  stimuli  close  to  the  category  boundary.  The  points  are 
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Figure  5:  Ear  dominance  for  16  individual  stimulus  pairs  in  Experiment  II, 

plotted  as  in  Figure  3,  with  the  best-fitting  linear  isolaterality 
contour  drawn  in. 


clearly  fitted  best  by  a linear  (or  perhaps  curvilinear)  function  through  the 
origin,  and  not,  for  example,  by  a linear  function  parallel  to  the  positive 
diagonal.  This  latter  function  would  be  the  isolaterality  contour  correspond- 
ing to  the  simple  difference  score,  p(H)  - p(FA),  as  an  index  of  ear  dominance 
(cf.  Repp,  1977b).  The  present  data  strongly  argue  against  this  simple 
difference  index  and  support  the  assumptions  underlying  the  e'  index. 

Discussion 


Experiment  II  successfully  replicated  the  results  obtained  in  the  in- 
phase  condition  of  Experiment  I.  The  competing  VOTs  of  voicing  contrasts 
appear  to  be  perceptually  integrated  according  to  an  additive  rule,  as  long  as 
neither  VOT  is  too  close  to  the  category  boundary.  When  the  VOT  of  one  of  the 
competing  stimuli  approaches  the  category  boundary,  this  stimulus  loses 
competitive  strength  and  is  dominated  by  the  opponent  stimulus.  Changes  in 
the  VOT  of  the  voiceless  stimulus  have  a more  pronounced  effect  than  changes 
of  equal  magnitude  in  the  VOT  of  the  voiced  stimulus. 

The  average  right-ear  advantage  in  the  present  experiment  was  not  as 
large  as  in  Experiment  I and  in  Repp  ( 1977a).  However,  the  exceptionally 
large  average  effects  obtained  earlier  were  probably  fortuitous  and  due  to  the 
small  subject  samples.  These  earlier  tests  probably  just  happened  to  include 
a number  of  subjects  from  the  upper  end  of  the  distribution.  Large  variation 
in  ear  dominance  is  desirable  for  methodological  purposes:  the  larger  the 
variation,  the  more  reliable  will  the  measurements  be  (unless  with  in-sub jec t 
variability  increases  in  proportion  to  between-sub ject  variation). 

GENERAL  DISCUSSION 

The  effects  of  variations  in  acoustic  stimulus  structure  on  dichotic 
stimulus  dominance  relationships  confirm  once  more  that  dichotic  interaction 
between  speech  stimuli  does  not  take  place  at  a purely  phonetic  level. 
Effects  of  variations  in  VOT  on  dichotic  stimulus  dominance  have  recently  also 
been  reported  by  Carney  and  Speaks  (1976)  and  Miller  (1976).  Whether  the 
interaction  takes  place  at  a purely  auditory  level  or  at  an  intermediate 
"multicategorical"  stage  (Repp,  1976b,  1977a)  cannot  be  decided  from  the 
present  data.  The  additivity  of  VOT  effects  for  true  with  in-category  stimuli 
and  the  breakdown  of  additivity  in  the  region  of  the  category  boundary  seem  to 
be  compatible  with  either  possibility.  However,  the  phase  effects  observed  in 
Experiment  I indicate  that  at  least  part  of  the  interaction  takes  place  at  a 
strictly  auditory  level — the  same  level  that  determines  fusion. 


The  finding  that  a change  in  the  VOT  of  th^  voiced  stimulus  had  a smaller 
effect  than  a change  in  the  VOT  of  the  voiceless  stimulus  may  have  been  due  to 
the  fact  that  the  stimuli  increased  in  amplitude  over  the  first  30  msec  or  so, 
which  is  the  region  of  short  VOT  values.  If  this  explanation  is  correct,  the 
effect  would  constitute  additional  evidence  for  interaction  at  an  auditory 
level.  According  to  informal  observations,  the  intensity  of  the  aspiration 
noise  is  another  auditory  factor  affecting  stimulus  dominance.  Strong  domi- 
nance of  voiceless  stimuli  over  voiced  ones  has  sometimes  been  reported  in  the 
literature  (for  example,  Berlin,  Lowe-Bell,  Cullen,  Thompson,  and  Loovis, 
1973).  Although  these  studies  used  natural  speech  stimuli,  the  perceptual 
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asymmetry  can  almost  certainly  be  explained  in  terms  of  the  auditory  proper- 
ties of  the  stimuli  (such  as  long  VOTs  and  heavy  aspiration  of  voiceless 
stops) . 

The  methodological  conclusions  from  the  present  studies  are  very  encour- 
aging. The  present  tests  show  a pronounced  average  right-ear  advantage,  large 
variation  between  subjects  and  high  reliability.  The  data  of  Experiment  II, 
at  least,  support  the  use  of  the  e'  index  proposed  by  Repp  (1977b).  Thus,  the 
present  methodology  appears  very  promising  for  the  further  investigation  of 
dichotic  laterality  effects.  The  size  of  the  average  ear  advantage  (and  of 
several  individual  ear  advantages)  obtained  here  and  in  Repp  (1977a)  is 
without  precedent  in  dichotic  research  with  normal  subjects.  It  may  be  that 
dichotic  voicing  contrasts  provide  optimal  conditions  for  lateral  asymmetries 
to  emerge:  they  are  sufficiently  well  fused  for  the  single-response  paradigm 
to  be  used,  but  not  so  strongly  fused  as  to  suppress  laterality  effects  (cf. 
Repp,  1976b).  Obviously,  the  small  difference  between  the  two  competing 
stimuli  in  their  first  40-60  msec  is  sufficient  to  produce  strong  ear 
asymmetries.  It  is  intriguing  to  speculate  that  there  is  a direct  relation- 
ship between  dichotic  fusion  and  the  suppression  of  ipsilateral  auditory 
transmission — one  of  the  factors  responsible  for  the  ear  advantage,  according 
to  Kimura's  (1961)  original  theory.  The  auditory  discrepancy  at  the  onset  of 
dichotic  voicing  contrasts  (periodic  vs.  noise  excitation)  may  lead  to  very 
effective  ipsilateral  suppression.  It  is  interesting  in  this  connection  that 
relative  phase  (Experiment  I)  had  no  effect  on  the  ear  advantage,  although  it 
affected  stimulus  dominance.  The  crucial  factor  in  ear  dominance  may  be  the 
periodic-noise  contrast  at  stimulus  onset  which,  of  course,  has  no  particular 
phase  relationship. 

It  is  also  likely  that  dichotic  voicing  contrasts  are  not  sensitive  to 
selective-attention  effects,  which  constitutes  another  methodological  plus. 
Although  it  may  be  possible  to  tell  in  which  ear  the  aspiration  noise 
occurred,  the  listener  would  have  to  infer  from  this  observation  that  the 
voiceless  stimulus  was  in  the  same  ear,  since  there  is  no  clear  percept  of  a 
separate  stimulus.  It  seems  that  such  inferences  can  be  avoided  by  proper 
instructions,  unlike  the  situation  with  unfused  stimuli  where  two  separate 
events  are  heard  and  selective-attention  strategies  constitute  a permanent 
noise  factor  that  is  difficult  to  control. 

One  principal  question  remains:  What  do  the  present  tests  measure,  that 
is,  what  is  their  validity?  Because  of  the  unusual  magnitude  of  the  ear 
advantages,  it  is  necessary  to  ask  whether  the  present  tests  measure  the  same 
phenomenon  that  traditional  two-response  tests  measure.  The  relatively  low 
correlation  between  the  labial  and  velar  tests  in  Experiment  I is  also  reason 
for  concern.  This  finding  is  reminiscent  of  the  similarly  imperfect  correla- 
tion between  the  e'  coefficients  for  the  voicing  and  place  features  in  Repp 
(1977a).  Research  is  now  in  progress  to  determine  whether  different  tests  and 
different  methodologies  assess  the  same  single  factor  of  laterality,  or 
whether  perhaps  multiple  factors  are  involved.  Clearly,  there  is  still  much 
to  be  learned  about  measuring  ear  advantages,  and  the  "perfect  test"  is  still 
far  in  the  future. 
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Vo ic  ing-Cond it ioned  Durational  Differences  in  Vowels  and  Consonants  in  the 
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ABSTRACT 

Seven  consonant-vowel-consonant  minimal  pairs,  differing  only 
in  the  voicing  characteristic  of  the  final  consonants,  were  elicited 
from  20  three-  and  four-year  old  native  speakers  of  American 
English.  Spectrographic  analyses  of  the  utterances  revealed  that 
(1)  children  produced  vowel  duration  differences  of  the  same  nature 
and  magnitude  as  those  found  in  adult  speakers'  utterances;  (2)  the 
duration  of  a preceding  vowel,  as  well  as  the  duration  of  voicing 
during  the  final  consonant  closure,  are  reliable  predictors  of  the 
voicing  characteristic  of  the  final  consonant;  (3)  other  measures, 
such  as  syllable  duration,  final  consonant  closure  duration,  and 
vowel  duration  plus  final  consonant  closure  duration,  are  not  as 
reliable  as  vowel  duration  and  closure  voicing  duration  as 
predictors  of  final  consonant  voicing;  (4)  the  three-  and  four-year 
olds  did  not  produce  significantly  different  vowel  or  closure 
voicing  duration. 

V’ 

INTRODUCTION 

Studies  of  vowel  production  by  adults  have  shown  that  vowels  preceding 
voiced  consonants  in  English  are  of  greater  duration  than  those  preceding 
voiceless  consonants  (Rositzke,  1939;  Heffner,  1941;  Belasco,  1953;  Peterson 
and  Lehiste,  1S60).  That  is,  the  vowel  of  /bid/  is  of  greater  duration  than 
that  of  /bit/,  and  that  of  /bAz/  greater  than  that  of  /bAs/.  These  durational 
differences  have  been  shown  to  be  cues  to  the  voicing  characteristic  of  final 
consonants  in  synthetic  speech  (Denes,  1955;  Raphael,  1972).  Although  it  is 
not  yet  clear  whether  these  differences  in  vowel  duration  have  a physiological 
basis,  or  are  entirely  learned  behavior,  they  appear  to  be  robust  phenomena  in 
English.  Investigators  report  ratios  ranging  from  1.25:1  to  2.3:1  between  the 
averaged  durations  of  vowels  in  opposing  voicing  contexts  (Raphael,  1971). 

The  purpose  of  the  present  study  was  to  investigate  the  differences  in 
duration  between  vowels  preceding  voiced  and  voiceless  stops  and  fricatives  in 
th  speech  of  three-  and  four-year-olds.  Specifically,  answers  were  sought 
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for  the  following  questions: 

1.  Do  young  children  systematically  produce  vowels  of  different  dura- 
tions before  word-final  voicing  oppositions? 

2.  If  such  systematic  differences  occur,  are  they  similar  in  magnitude 
to  those  produced  by  adult  speakers? 

3.  Is  there  systematic  variation  in  the  duration  of  voicing  during 
consonant  closure? 

4.  How  well  do  vowel  duration  and  other  durational  factors  predict  the 
intended  voicing  characteristic  of  a word-final  consonant? 

5.  Are  there  age-related  differences  between  three-  and  four-year  olds 
in  the  above  measures? 


METHOD 


The  twenty  informants  were  nine  three-year  old  and  eleven  four-year  old 
children.  Seventeen  of  the  children  were  enrolled  in  a nursery  school  in  the 
Bronx,  New  York.  The  remaining  three  subjects  were  children  of  students  and 
of  faculty  members  of  Herbert  H.  Lehman  College  of  the  City  University  of  New 
York.  The  age  range  was  from  three  years  and  three  months  to  four  years  and 
eight  months. 

Seven  minimal  pairs  were  elicited  from  the  children  and  recorded  on  audio 
tape.  The  seven  minimal  pairs  were:  (1)  rope-robe , (2)  Bert-bird , (3)  t ight- 
tied , (4)  pick-pig , (5)  peck-peg , (6)  leaf-leave , and  (7)  loose-lose . All  the 
English  stop  contrasts  and  the  two  most  common  fricative  contrasts  are 
contained  in  the  list  of  utterances.  The  vowel  inventory  was  representative 
of  all  categories  of  tongue  height,  tongue  advancement  and  tongue  "tension." 

The  experimenters  attempted  to  elicit  each  word  without  prompting — that 
is,  without  saying  the  words  themselves — as  the  last  element  in  a syntagmatic 
response  frame  or  as  the  response  to  a picture  or  object  shown,  or  action 
performed.  Each  child  was  told  that  he  would  be  playing  a word-guessing  game 
and  was  given  six  practice  items  that  have  not  been  incorporated  into  the 
data.  Once  the  subject  seemed  to  understand  the  task,  the  elicitation  of  the 
minimal  pairs  listed  above  began.  With  the  exception  of  the  second  pair  on 
the  list  ( Bert-bird) , no  minimal  pair  items  were  elicited  consecutively. 

If  a child  did  not  respond,  or  if  the  child's  response  was  not  the  target 
word,  then  the  experimenter  spoke  the  word  aloud  and  said  that  he  would  ask 
again  later  to  see  if  the  child  remembered  it.  The  child  did  fairly  often, 
and  the  word  was  recorded.  Such  responses  have  been  noted  as  delayed 
imitations,  since  the  word  was  not  said  by  the  experimenter  immediately  before 
the  child  spoke  it.  If  the  child  did  not  subsequently  remember  the  word,  then 
the  investigator  spoke  it  and  asked  the  child  to  repeat  it.  We  have  noted 
such  responses  as  immediate  imitations.  Often  the  word  was  elicited  still 
later  in  the  session,  and  thus  several  items  were  recorded  first  as  immediate 
imitations  and  then  as  delayed  imitations.  Any  "correct"  response  was 
immediately  followed  by  a request  for  repetition. 
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Of  the  more  than  600  tokens  that  were  initially  analyzed,  18  percent  were 
immediate  imitations,  21  percent  were  delayed  imitations,  and  61  percent  were 
unprompted  responses.  The  necessity  for  prompting  varied  greatly  from  word  to 
word.  For  example,  since  all  of  our  subjects  watch  Sesame  Street,  Bird  (from 
Big  Bird)  and  Bert  (from  Ernie  and  Bert)  never  had  to  be  prompted.  On  the 
other  hand,  very  few  of  the  subjects  were  acquainted  with  the  verbs  leave  and 
peck . No  systematic  differences  have  been  found  among  the  data  derived  from 
the  unprompted  responses,  delayed  imitations,  or  immediate  imitations. 
Nevertheless,  all  of  the  immediate  imitations  have  been  eliminated  from  the 
data  except  for  those  which  were  the  only  instances  of  a response  type  for  a 
given  subject. 

Since  we  asked  for  an  immediate  repetition  of  any  desired  response,  most 
of  the  data  is  derived  from  two  or  more  tokens  of  each  response  type  per 
child.  Those  data  derived  from  only  one  token  of  a response  type  generally 
resulted  from  the  elimination  of  other  tokens  because  of  extraneous  background 
noise  that  rendered  the  spectrograms  "unreadable,"  or  from  various  other 
causes  such  as  sudden  upward  leaps  in  fundamental  frequency  which  virtually 
transformed  spectrograms  from  wide  to  narrow  band  and  thus  removed  the  formant 
information  which  is  essential  for  segmenting  and  measuring. 

Wideband  spectrograms  were  made  of  each  utterance  using  a Kay  Sonagraph, 
model  6061B.  The  tapes  were  played  into  the  Sonagraph  at  half-speed  in  order 
to  lower  the  frequencies  of  the  children's  speech,  thus  making  the  presence, 
initiation,  and  termination  of  fundamental  frequency  more  easily  discernible. 
This  also  facilitated  the  durational  measurements  by  expanding  the  time  scale 
of  the  spectrograms. 


The  following  measures,  estimated  to  the  nearest  5 msec,  were  derived 
from  the  spectrograms: 


1. 

The 

dura’ 

2. 

The 

dura 

3. 

The 

dura 

4. 

The 

tota 

5. 

The 

syl  1 

6. 

The 

tota 

7. 

The 

tota 

closure . 

was 

the 

We 

will 

not  ( 

6.  The  total  duration  of  voicing  during  the  syllable. 

duration  of  voicing  during  the  vowel  and  final  consonant 
In  the  case  of  the  stop-initialed  syllables,  this  duration 


and  measuring  durations  of  speech  sounds  in  their  spectrographic  representa- 
tions. It  is  clear  that  vowel  duration  differences  of  the  magnitude  that  we 
have  most  often  found  are  easily  discernible  on  spectrograms.  Specifying 
durations,  however,  is  considerably  more  difficult  than  simply  noting  that 
they  are  there.  This  is  because  of  the  segmentation  problem,  and  the  degree 


of  difficulty  depends  to  a great  extent  on  the  utterance  in  question.  A pair 
of  the  type  t ight-t ied  presents  minimal  difficulty:  the  burst  releases  for 

the  stops  are  clear  indications  of  both  the  beginning  and  the  end  of  the 
syllable;  the  onset  and  offset  of  formants  and  of  fundamental  frequency  are 
generally  easily  visible  and  adequately  define  the  limits  of  the  vowel  and  of 
the  voicing  during  the  consonant  closure.  On  the  other  hand,  a pair  of  the 
type  leaf-leave  presents  some  serious  difficulties,  among  them  being  the 
segmentation  of  the  initial  consonant  from  the  vowel,  and  definition  of  the 
limits  of  the  final  consonant,  especially  its  termination  in  low-intensity 
noise.  When  uncertainty  ran  too  high,  utterances  were  simply  eliminated  from 
the  corpus.  Thus,  25  percent  of  the  originally  recorded  and  spectrographical- 
ly  displayed  tokens  were  not  measured.  Otherwise,  correspondences  were  sought 
between  all  the  tokens  of  both  members  of  a minimal  pair,  and  some  recurring 
acoustic  event  was  used  as  a landmark  for  segmentation.  For  example,  voicing 
during  final  consonant  closure  was  measured  up  to  the  first  break  in  the 
regular  pulsing  of  the  vocal  folds  as  delineated  by  the  low  frequency  vertical 
striations  on  the  wideband  spectrogram.  This  landmark  was  used  even  though 
there  were  many  examples  of  sporadic  voicing  after  the  break  and  occasionally 
throughout  the  final  consonant  closure.  Since  closure  voicing  tends  to 
decrease  over  time,  it  is  not  at  all  certain  whether  the  vocal  pulses  are 
audible,  even  before  the  voicing  break  used  as  a landmark.  If  any  of  the 
pulsing  is  audible,  however,  it  seems  reasonable  to  assume  that  it  will  occur 
before  the  transglottal  pressure  differential  has  diminished  sufficiently  to 
introduce  a hiatus  in  vocal  fold  vibration. 

RESULTS  AND  DISCUSSION 

The  average  vowel  durations  for  both  members  of  each  minimal  pair  are 
shown  in  Table  1.  In  95  percent  of  the  tokens,  the  vowels  are  of  greater 
duration  when  preceding  voiced  consonants  than  when  preceding  voiceless 
consonants.  Of  the  twenty  subjects,  fourteen  showed  no  reversals  of  vowel 
duration  for  any  of  the  oppositions.  That  is,  the  vowel  preceding  a voiceless 
consonant  for  these  subjects  was  never  as  long  or  longer  than  the  one 
preceding  the  voiced  cognate  consonant.  Two  of  the  utterance  pairs  ( rope-robe 
and  pick-pig)  showed  no  reversals  for  any  subjects,  and  only  one  pair  (peck- 
peg)  provided  as  many  as  two  reversals.  Of  those  six  subjects  who  produced 
vowels  of  equal  or  greater  duration  before  voiceless  consonants,  only  one  did 
so  for  two  minimal  pairs  ( Bert— bird  and  loose— lose ) , the  others  reversing  in 
one  pair  only. 

The  durational  differences  found  were  similar  in  magnitude  to  those 
produced  by  adults.  The  range  of  adult  vowel  duration  ratios  from  1.25:1  to 
2.3:1  contains  fully  90  percent  of  the  tokens  in  our  data.  Almost  all  those 
outside  this  range  were  greater  than  2.3:1. 

We  now  consider  the  other  durational  measures  and  how  well  they  predict 
the  voicing  of  word— final  consonants.  Table  1 displays  the  averaged  duration 
of  closure  voicing  for  voiced  and  voiceless  consonants  for  each  minimal  pair. 
For  each  contrast  the  mean  duration  is  greater  for  the  voiced  consonants.  In 
90  percent  of  the  contrasting  tokens  the  closure  voicing  is  of  greater 
duration  for  final  voiced  consonants.  Of  132  contrasts,  there  are  only  13 
reversals  in  duration  of  closure  voicing. 
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TABLE  1:  Averaged  durations  and  ratios  of  vowels  and  of  voicing  during 

consonant  closure  for  the  members  of  each  minimal  pair. 


Average 

Average 

Closure 

Vowel 

Voicing 

Durat ion 

Duration 

limal  Pair 

(msec ) 

Ratio 

(msec ) 

Ratio 

Rope 

129.7 

50.0 

Robe 

209.4 

1.6:1 

112.5 

2.3:1 

Bert 

227.9 

48.7 

Bird 

321.5 

1 .4:1 

80.8 

1.7:1 

Tight 

221.8 

30.5 

Tied 

324.5 

1.5:1 

75.3 

2.5:1 

Pick 

99.8 

20.3 

Pig 

196.3 

2.0:1 

48.2 

2.4:1 

Peck 

131.6 

24.5 

Peg 

198.0 

1.5:1 

86.3 

3.5:1 

Leaf 

150.8 

48.3 

Leave 

247.9 

1.6:1 

89.7 

1.9:1 

Loose 

136.3 

57.5 

Lose 

231.8 

1.7:1 

121.8 

2.1:1 

It  is  interesting  to  note  that  (in  Table  1)  the  ratios  of  the  closure 
voicing  durations  from  the  voiced  to  the  voiceless  contexts  are  consistently 
greater  than  those  of  the  averaged  vowel  durations  for  each  minimal  pair.  If 
we  assume  perceptual  significance  for  these  temporal  features,  then  the  effect 
of  the  difference  in  ratios  would  be  to  maximize  the  salience  of  the  cue 
(closure  voicing)  with  the  lesser  duration. 

Although  neither  vowel  duration  nor  closure  voicing  perfectly  predicted 
the  intended  voicing  characteristic  of  the  word-final  consonants,  the  combina- 
tion of  the  two  measures  did.  That  is,  the  sum  of  the  durations  of  vowel  and 
closure  voicing  for  each  subject  was  always  greater  for  the  member  of  a 
minimal  pair  ending  in  a voiced  consonant  than  for  the  member  ending  in  a 
voiceless  consonant. 


Figure  1 displays  the  magnitude  of  the  averaged  differences  between  the 
durations  of  vowels,  closure  voicing  and  both  measures  taken  together.  The 
averaged  differences  for  the  combined  measures  for  each  of  the  contrasts  fall 
between  124  and  160  msec.  It  is  tempting  on  the  basis  of  these  data  to 
speculate  that  some  small  range  of  overall  differences  is  being  aimed  at  by 
speakers.  The  range  of  overall  differences  for  vowel  plus  closure  voicing  is 
less  than  36  msec.  However,  the  way  in  which  the  total  difference  is  divided 
between  vowel  and  closure  voicing  varies  considerably  from  one  utterance  pair 


119 


Averaged  durational  differences  between  vowels,  closure 
the  sums  of  both  measures. 


to  the  next.  In  peck-peg  for  example,  the  total  difference  is  comprised  by 
almost  equal  differences  between  vowel  duration  and  closure  voicing  duration. 
In  pick-pig  on  the  other  hand,  more  than  three- fourths  of  the  total  difference 
is  supplied  by  the  difference  between  the  durations  of  the  vowels. 

With  one  exception,  none  of  the  other  spec trograph ic  measurements  pre- 
dicted the  voicing  characteristic  of  the  final  consonants  as  well  as  either 
vowel  duration  or  closure  voicing  duration  (or  as  well  as  the  summed 
measures).  The  closure  duration  of  the  final  consonant  was  greater  for  the 
voiceless  cases  in  71  percent  of  the  tokens.  The  sum  of  vowel  duration  and 
final  consonant  closure  duration  was  greater  for  79  percent  of  the  syllables 
ending  in  voiced  consonants.  This  last  result  is  to  be  expected,  of  course, 
since  the  effectiveness  of  vowel  duration  as  a predictor  is  reduced  by  that  of 
closure  duration  which  was  most  often  greater  in  the  case  of  the  voiceless 
consonants.  The  total  syllable  duration  was  greater  for  the  voiced  context  in 
65  percent  of  the  tokens.  The  predictive  power  of  this  measure  suffers 
because  of  the  variability  of  the  durations  of  the  consonants,  especially 
those  in  syllable-initial  position  which  varied  asystematically  with  regard  to 
the  voicing  class  of  the  final  consonants. 

The  exception  mentioned  above  resides  in  the  measure  of  total  voicing 


duration  in  the  syllable,  which  predicted  final  consonant  voicing  characteris- 
tics in  97  percent  of  the  tokens.  The  strength  of  this  measure  exceeds  all 
but  that  of  the  perfect  predictive  power  of  the  summed  vowel  and  closure 
voicing  durations.  It  falls  short  of  perfection  because  of  the  adverse  effect 
of  the  asystematic  variation  in  the  duration  of  voicing  in  the  initial 
resonant  consonants.  In  the  cases  of  stop-initiated  syllable,  this  measure 
was  identical  with  that  of  the  summed  durations  of  vowel  and  closure  voicing. 

Finally,  we  consider  age-related  differences  in  the  productions  of  the 
three-  and  four-year  olds.  The  average  differences  between  durations  of 
vowels,  closure  voicing  and  the  two  measures  combined  are  shown  in  Table  2. 
Although  large,  the  age-related  differences  are  not  statistically  significant. 


TABLE  2:  Average  differences  in  msec  for  vowel  duration,  final-consonant 

closure-voicing  duration,  and  both  measures  combined.  Differences 
are  also  expressed  as  percentages  of  the  combined  measure. 


Vowel 

Closure 

Vowel 

+ 

Closure 

Three-year  olds 

msec 

percent 

83.2 

69.4 

36.8 

30.6 

120.0 

100.0 

Four-year  olds 

msec 

percent 

98.6 

63.9 

55.6 

36.1 

154.2 

100.0 

All  Subjects 

msec 

percent 

90.2 

65.4 

47.8 

34.6 

138.0 

100.0 

By  age  three,  children  produce  differences  in  vowel  and  closure  voicing 
durations  before  voiced  and  voiceless  final  consonants  which  are  of  a 
magnitude  similar  to  that  found  in  adults'  productions.  Research  with  even 
younger  children  will  be  necessary  in  order  to  reveal  when  such  differences 
are  first  manifest  and  how  they  develop  during  the  first  years  of  language 
use . 
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EMG  Signal  Processing  for  Speech  Research 
Diane  Kewley-PortT 


ABSTRACT 

This  paper  describes  and  evaluates  the  EMG  signal  processing 
techniques  employed  at  Haskins  Laboratories.  EMG  signals  from  many 
electrodes  are  collected  simultaneously  on  a multichannel  tape 
recorder  along  with  the  audio  signal  from  a speaker.  Signals  are 
later  processed  in  two  stages.  The  first  includes  short-time  signal 
integration.  In  the  second  stage,  a computer  averages  a set  of  EMG 
signals  from  repetitions  of  an  utterance  carefully  aligned  to  the 
same  acoustic  event.  The  purpose  of  the  signal  processing  is  to 
produce  an  average  EMG  signal  that  is  reliable,  relatively  noise- 
free  and  easy  to  compare  to  the  corresponding  speech  events.  To 
evaluate  the  signal  processing,  two  experiments  were  conducted  in 
which  EMG  signals  were  collected  simultaneously  from  several  elec- 
trodes placed  in  the  same  muscle.  Several  aspects  of  the  sampling, 
integration  and  averaging  techniques  are  evaluated.  Results  indi- 
cate that  the  signal  processing  succeeds  in  producing  a reliable, 
noise-free  EMG  signal  for  use  in  speech  research. 

INTRODUCTION 

Haskins  Laboratories  has  developed  an  electromyography  (EMG)  research 
facility  to  study  speech  production.  This  paper  describes  and  evaluates  the 
signal  processing  techniques  used  to  transform  the  small,  noisy  EMG  signal 
picked  up  at  the  electrodes  into  a reliable,  slowly  varying  signal  that  can  be 
easily  compared  to  speech  events. 

Signal  processing  of  the  raw  EMG  signals  as  designed  by  Cooper  (1965) 
involves  two  discrete  stages.  The  first  stage  is  a standard  EMG  processing 
procedure  of  integrating  the  rectified  EMG  signal.  However,  the  time  constant 
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of  integration  is  kept  small  in  relation  to  the  durations  of  articulatory 
change.  To  eliminate  the  remaining  noise,  a set  of  integrated  EMG  signals 
from  repetitions  of  the  same  speech  event  are  averaged  together,  producing  an 
average  EMG  signal . 

Presumably,  the  average  EMG  signal  represents  the  signal  for  muscle 
contraction  from  a small  population  of  motor  units.  However,  in  computing  the 
average  EMG  signal,  the  randomness  inherent  in  the  raw  EMG  signal  is  reduced. 
In  this  sense,  the  average  EMG  signal  represents  some  underlying  regularity  in 
the  signals  causing  muscle  contraction  for  a given  speech  event,  which  we  call 
the  underlying  neuromotor  command. 

This  paper  is  devoted  to  describing  and  experimentally  evaluating  this 
two-stage  EMG  signal  processing  system.  These  analyses  are  undertaken  to 
provide  at  least  partial  answers  to  questions  commonly  raised  about  this 
processing  technique.  Of  special  interest  to  the  author  are  the  analyses 
attempting  to  demonstrate  that  average  EMG  signals  do  represent  the  underlying 
neuromotor  commands  to  a muscle. 

EMG  SIGNAL  PROCESSING 


Review  and  Description 

The  EMG  signal  is  initially  detected  by  electrodes  inserted  directly  in  a 

muscle.  Waring  (1974),  in  his  review  of  the  use  of  different  electrode 

technologies,  stated  that  for  the  purpose  of  obtaining  an  estimate  of  general 

muscle  activity,  the  type  of  electrode  and  the  exact  specification  of  its 
placement  in  the  muscle  is  not  of  critical  importance.  The  electrodes  in 
standard  use  consist  of  a pair  of  fine  hooked-wires  as  described  by  Basmajian 
and  Stecko  (1962)  and  modified  for  use  in  speech  research  by  Hirose  (1971). 
Hooked-wire  electrodes  have  been  chosen  for  research  of  the  speech  musculature 
because  there  is  minimal  discomfort  for  the  subject  with  the  resulting  speech 
being  "normal."  Analogous  considerations  have  made  similar  electrodes  an 
increasingly  popular  choice  in  a wide  range  of  applications  (Basmajian,  1974). 

The  electrical  signal  observed  at  the  electrodes  is  a bipolar  signal  that 
represents  the  summation  of  the  muscle  action  potentials  from  a number  or 

motor  units.  The  motor  unit  potentials  are  believed  to  arise  within  the 
volume  of  fibers  within  1 to  2 mm  of  the  electrodes  (Buchthal,  1959;  Waring, 
1974).  The  single  motor  unit  potential  has  frequency  components  up  to  2000  Hz 
(Trimble,  Zuber  and  Trimble,  1973)  and  a low  amplitude,  averaging  around  500 
microvolts,  but  ranging  from  1 to  over  1000  yv  (Buchthal,  1959;  Person  and 
Kudina,  1972;  Basmajian,  1974;  Waring,  1974).  In  summation,  the  potentials 
from  several  motor  units  firing  asynchronously  partially  cancel  one  another, 
producing  an  EMG  signal  referred  to  as  the  interference  pattern.  The  problem 
then  is  to  recover  from  this  small,  jittery,  noisy  EMG  signal,  a robust, 
reliable,  slowly  moving  signal. 

The  solution  adopted  here  was  to  pass  EMG  signals  stored  on  a tape 
recorder  through  two  stages  of  signal  processing  (ct.  Cooper,  1965).  The 
first  stage  is  a standard  procedure  of  amplifying,  rectifying  and  integrating 
the  EMG  signals.  These  integrated  EMG  signals  have  been  shown  by  various 
investigators  to  yield  simple  functional  relationships  with  measures  of  muscle 
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tension  and  velocity  of  movement  under  certain  experimental  conditions.  For 
example,  it  was  found  that  integrated  EMG  is  linearly  related  to  muscle  force 
(tension)  for  isometric  contractions  (Inman,  Ralston,  Saunders,  Feinstein  and 
Wright,  1952;  Lippold,  1952),  for  isotonic  contractions  (Bigland  and  Lippold, 
1954),  for  constant  torques  about  the  elbow  (Leifer,  1969),  and  in  limited 
unidirectional  movement  (Bouisset  and  Goubel , 1971).  (Also,  see  discussion  in 
Basmajian,  1974,  p.  168-172).  The  generality  of  these  findings  has  been 
challenged  by  Zuniga  and  Simons  (1969)  for  isometric  contractions  near  maximum 
effort,  and  by  Coggshall  and  Bekey  (1970)  for  isometric  contractions  where 
tension  was  dynamically  developed  and  tracked.  None  of  these  experiments 
corresponds  to  the  conditions  of  normal  muscular  activity  in  speech  where 
activity  is  clearly  anisometric,  and  no  specific  relationship  between  integ- 
rated EMG  and  articulatory  movement  of  force  has  been  proposed.  However, 

Bell-Berti  and  Hirose  (1975)  have  shown  in  one  study  of  velar  closure,  that 

the  size  of  the  increase  in  integrated  EMG  activity  and  the  size  of  the  change 
in  height  of  the  velum  were  highly  correlated,  l and  Hirose  (1976)  has  shown  a 
similar  relationship  for  PCA  activity  and  the  size  of  the  glottal  chink]. 

Integration  alone  is  not  a suitable  technique  to  smooth  out  a noisy 
interference  pattern  for  comparison  to  articulatory  movement.  Significant 
changes  in  articulator  movement  may  occur  over  5-10  msec  for  rapidly  moving 
structures  (for  example,  the  tongue  tip)  or  50  msec  or  longer  for  slower 
structures  (for  example,  the  jaw).  Integration  time  constants  of  5 or  10  msec 

produce  a very  noisy  looking  EMG  signal  (see  Figure  4),  and  time  constants 

greater  than  50  msec  severely  reduce  information  concerning  signal  onset  or 
peak  height . 

Cooper  (1965)  proposed  a solution  to  this  problem  by  providing  a second 
stage  of  signal  processing.  First,  EMG  signals  are  collected  from  many 
repetitions  of  the  same  utterance  and  individually  integrated  with  a minimal 
time  constant  (5  to  30  msec).  Then  the  digitally  sampled  EMG  signals  are 
computer  averaged  after  careful  time  alignment  of  the  repetitions. 

Model  for  Averaging  EMG  Signals 

The  rationale  behind  averaging  can  be  discussed  using  a simple  model  of 
the  sampled  integrated  EMG  signal.  Let  the  integrated  EMG  signal  after 
sampling  be  defined  as  the  sum  of  two  components  such  that  for: 

t = tjj(  k“l , 2, 3,  . . . ,n)  , time  sampled  at  instances  tj,  t2,...,tn 

i = ith  repetition  of  an  utterance 

p = pth  placement  of  an  electrode  in  a muscle 

epi^t^  = ®pi(t)  + np^(t)  (1) 

where 

epj(t)  is  the  waveform  sampled  by  the  computer 

Sp i ( t ) is  the  signal  representing  a continuous  underlying  motor  command 
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We  consider  Sp£(t)  to  be  the  envelope  of  the  integrated  EMG  signal.  The 
additive  noise,  then,  consists  partly  of  oscillations  about  this  envelope,  due 
to  the  limited  number  of  motor-units  sampled  and  to  the  way  in  which  the 
continuous  underlying  command  is  transmitted  by  discrete  motor-unit  firings. 
This  component  of  the  noise  is  defined  as  zero  mean.  Another  component  of 
npf(t)  is  electrical  system  noise,  or  the  electrode  noise  sources  discussed  by 
Waring  (p.  246-48:1974).  Due  to  rectification  and  integration,  this  component 
has  a nonzero  DC  value.  However,  this  component  is  stationary,  so  we  can 
redefine  the  noise  as  a zero  mean  by  shifting  the  DC  value  of  the  signal. 
This  fact  must  be  remembered  when  later  interpreting  the  results:  only 

changes  in  signal  level  and  not  absolute  levels  are  meaningful. 


In  our  model,  both  the  ensemble  of  signal  components,  (Sp(t)}  and  the 
ensemble  of  noise  components  {np£(t)}  are  random  processes.  The  main  source 
of  randomness  in  { Sp ( t 1 } is  normal  variation  in  the  production  of  individual 
repetition  of  an  utterance,  that  is,  differences  in  timing  and  effort  of 
articulatory  gestures.  As  we  have  defined  {np£(.t)},  it  is  zero  mean  over 
intervals  for  which  the  signal  is  shown  moving,  although  it  is  not  stationary 
or  independent  of  the  signal. 


Returning  to  the  model,  let  Ep(t) 
electrode  p,  which  is  calculated  as 


stand  for  the  average  EMG  signal  at 


Ep(t) 


i m 

-1  l 

m *-• 


6pi ( t ) 


for  m utterance  repetitions  and  for  all  t^  samples.  Expanding  equation  U) 
for  averaging, 


1 ill  . iu 

E (t)  =-  l S ,(t)  + - l n .(t) 
P m 1=1  P m i=l  pl 


Obviously,  for  large  enough  m,  the  noise  term  vanishes  so  that, 
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Success  in  eliminating  the  noise  depends  on  a large  enough  m,  which  in 
this  case  is  the  number  of  times  a subject  repeats  an  utterance.  The 
filtering  effect  of  integrating  ep^(t)  using  small  time  constants  of  5 tc  30 
msec  reduces  the  amplitude  of  the  noise  and  therefore  reduces  the  number  of 
repetitions  needed. 


Summarizing  the  two  stage  signal  processing  used: 

1.  EMG  signals  are  collected  from  many  repetitions  of  an  utterance  and 
recorded  on  a tape  recorder. 


2.  EMG  signals  are  amplified,  rectified,  integrated  with  a small  time 
constant,  and  stored  in  a computer.  This  integration,  while  reducing 
the  high  frequency  components  of  the  signal,  preserves  the  slow 
moving  components  of  EMG  activity  comparable  to  those  of  articulatory 
change . 

3.  Integrated  EMG  signals  from  the  same  utterance  are  averaged  in  time. 
The  average  EMG  signal,  in  one  model,  represents  the  average  of  the 
potentials  of  a number  of  motor  units  firing. 

This  two-stage  process  produces  a good  estimate  of  the  envelope  of  the 
interference  pattern  without  excessive  time-smearing  or  too  many  repetitions. 

INSTRUMENTATION 

Data  Collection  and  Playback 

A block  diagram  of  the  recording  and  playback  equipment  is  presented  in 
Figure  1.  The  first  step  in  data  collection  is,  of  course,  the  placement  of 
electrodes  and,  occasionally,  pressure  transducers.  The  subject  is  made  as 
comfortable  as  possible,  and  each  channel  is  checked  for  correct  placement  and 
adjusted  for  maximum  gain  before  the  experiment  begins.  The  subject  is  then 
asked  to  read  (or  repeat)  the  desired  speech  material. 

The  EMG  signals  collected  by  the  bipolar  wire  electrodes  go  to  differen- 
tial preamplifiers  that  have  gains  of  40  dB,  noise  levels  (referred  to  the 
inputs)  of  5 yv  RMS,  and  ca.  100  dB  common  mode  rejection.  From  the 
preamplifiers,  the  signals  go  to  distribution  amplifiers  with  adjustable  gains 
that  are  usually  set  at  about  30  dB.  These  amplifiers  include  80  Hz  high-pass 
filters  with  24  dB  roll-off  to  reject  movement  artifacts  and  hum.  The 
filtered  signals  are  then  recorded  on  a one-inch,  14-channel  instrumentation 
recorder  (Consolidated  Electrodynamics  VR-3300).  The  EMG  and  pressure  signals 
are  recorded  in  FM  mode  with  an  upper  frequency  limit  of  2300  Hz;  voice 
channels  and  timing  and  code  pulses  are  recorded  in  direct  mode. 

A calibration  signal  (300  yv  +_  1%)  referred  to  the  pre-amp  inputs  is 
occasionally  substituted  for  each  of  the  physiological  signals,  so  that  the 
signals  can  be  assigned  absolute  values  of  microvolts  for  EMG  channels  and 
centimeters  of  water  for  pressure  channels. 

A recording  channel  is  used  for  voice  signals,  both  for  the  subject's 
utterances,  and  in  order  to  take  note  of  events  and  changes  in  procedure 
during  the  course  of  the  experiment.  Two  other  channels  are  used  to  record  a 
clock  track  and  a code  and  timing  track.  The  former  consists  of  pulses  at  a 
rate  of  3200  Hz;  the  latter,  to  pulses  at  a rate  of  50  Hz,  counted  down  from 
the  clock.  A 4-digit  octal  timing  code  is  periodically  (usually  one  per 
second)  incremented  and  superimposed  on  the  50  Hz  pulse  train.  (This  way  of 
introducing  the  identification  codes  has  provided  a good  compromise  solution 
to  the  problem  of  making  the  oscillographic  record  easily  readable  by  humans 
and  the  tape-recorded  version  readable  by  computer.) 

Two  data  channels  can  be  used  for  pressure  recording.  The  differential 
pressure  transducers  used  are  SETRA  Systems  Model  2364.  The  sensitivity  range 
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extends  from  9 to  32  centimeters  of  water  at  4°  C,  although  the  usual  range  is 
from  9 to  16  cm.  water.  The  frequency  response  of  the  system  for  measuring 
pharyngeal  pressure  is  from  9 - 390  Hz;  the  bandwidth  of  the  system  for 
measuring  subglottal  pressure  is  somewhat  less.  Output  from  the  pressure 
transducers  goes  directly  to  dc  amplifiers  and  is  recorded  on  FM  channels  on 
the  tape  recorder.  The  390  pv  calibration  signal  is  equivalent  to  3 
cm.  water. 

For  visual  inspection  of  the  recorded  signals,  the  EMG  data  channels, 
voice  channel,  and  code  and  timing  track  are  played  back  as  input  to  an  18- 
channel  Honeywell  Visicorder.  The  recording  speed  of  7.5  inches  per  second  is 
doubled  on  playback  to  accelerate  processing.  Each  EMG  channel  is  routed 
again  through  the  distribution  amplifiers  and  high-pass  filters,  resulting  in 
an  overall  80  to  1250  Hz  frequency  range.  The  pressure  channels  are  not 
filtered . 

From  the  amplifiers  the  signals  are  full-wave  rectified  and  integrated  by 
linear-reset  integrators.  These  integrators  sum  linearly  over  a 5 msec 
interval,  and  are  reset  over  3 msec.  They  are  sampled  just  before  reset  by  a 
16  channel  multiplexer.  All  these  operations  are  controlled  by  the  clock 
track . 

Computer  Processing 

The  basic  tasks  accomplished  by  the  computer  processing  system  are  to 
locate  each  utterance  on  the  instrumentation  tape,  sample  and  store  the  EMG 
signals  for  each  token,  and  then  to  align  repetitions  of  the  same  utterances 
in  time  and  calculate  the  average  EMG  signal. 

Analysis  of  an  experiment  begins  with  an  inspection  of  the  Visicorder 
traces.  An  utterance  is  located  with  respect  to  an  octal  code  recorded  on  the 
code  and  timing  track  trace.  Utterances  are  usually  averaged  with  respect  to 
a particular  acoustic  event,  such  as  the  beginning  of  vocal  cord  vibration, 
although,  of  course,  utterances  could  be  averaged  with  respect  to  another  type 
of  event,  such  as  the  onset  of  activity  in  a particular  muscle.  A temporal 
offset  interval  is  measured  from  the  beginning  of  code  to  the  line-up  point 
using  the  20  msec  timing  pulses  on  the  code  and  timing  trace.  Typically,  the 
offset  interval  is  specified  to  the  nearest  5 msec  (a  quarter  of  an  interval 
on  the  timing  trace),  which  is  within  the  inherent  uncertainty — estimated  at 
ca.  +_  10  msec — that  is  involved  in  locating  a lineup  point  on  the  voice  trace. 
The  sampling  interval  is  also  5 msec. 

Thus,  two  descriptors  specify  an  utterance,  its  CODE  and  LINE-UP  INTER- 
VAL. A sample  window  of  two  seconds  or  less  is  specified  for  each  utterance 
relative  to  the  line-up  point.  A list  of  the  CODES  and  LINE-UP  INTERVALS  is 
prepared  for  each  utterance  type  to  be  averaged  together  by  the  computer.  Two 
alternative  procedures,  with  equivalent  results,  are  used  for  making  these 
measurements.  Either  they  can  be  made  by  hand,  and  later  typed  into  a 
computer,  or  they  can  be  made  directly  on  a digitizing  tablet. 


A brief  description  of  the  computer  configuration  and  the  EMG  programs 
provide  an  overview  of  data  processing.  (Details  are  available  in  Kewley- 
Port , 1973).  The  computer  is  a Honeywell  DDP-224  with  24  bits/word  and  a 32K 


memory.  Essential -'storage  devices  to  permit  one-pass  sampling  of  the  EMG 
signals  include  four  2.4  million  word  disk  units  and  two  high  speed  magnetic 
tape  units.  Programs  are  supervised  by  a time-sharing  monitor  with  communica- 
tion through  one  of  three  alphanumeric  CRT  Terminals.  Output  devices  include 
a line-printer  and  a 12  inch  storage  display  scope  attached  to  a hard  copy 
unit. 

For  purposes  of  processing,  an  EMG  experiment  is  defined  by  the  fixed 
storage  capabilities  of  the  computer  programs.  In  a single  pass  of  the  EMG 
analog  tape,  8 channels  of  EMG  or  pressure  data  can  be  simultaneously  sampled 
and  stored  for  up  to  30  utterance  types,  with  up  to  30  tokens  of  each  type. 
All  the  information  for  one  experiment  is  stored  on  one  digital  mag  tape.  For 
the  purposes  of  computer  processing,  the  pressure  channels  are  treated  like 
the  EMG  channels. 

Nine  computer  programs  are  central  to  the  EMG  data  processing  system. 
They  are  listed  in  order  of  their  use  with  a brief  task  description. 

E$MGESEL:  All  control  information  is  input  and  stored  on  a mag  tape. 

E$MGSCAN:  EMG  signals  are  sampled  as  specified  by  the  control  informa- 

tion. Information  is  displayed  on  the  storage  scope  to  enable  the  operator  to 
optimize  playback  levels  of  each  channel  for  the  dynamic  range  of  the  A/D 
converters  and  check  for  gross  clerical  errors  in  the  control  information. 

ESMGSTOR:  All  EMG  signals  are  sampled  with  12  bit  precision,  stored  on 

disk  and  then  transferred  to  the  mag  tape. 

E$MGSUMS:  The  sums  and  sums  of  squares  of  the  tokens  for  each  ut  erance 

are  calculated  and  stored  on  map  tapes.  Only  7 of  the  12  bits  of  data  stored 
are  summed.  Further  digital  smoothing  of  the  EMG  signals  before  averaging  can 
be  specified  in  5 msec  increments,  called  the  t ime  constant  of  integration. 
The  smoothing  function  is  triangular. 

ESMGPAGE:  All  of  the  individual  EMG  signals  used  in  calculating  an 

average  can  be  displayed  on  the  storage  scope  for  visual  inspection  and 

editing  by  the  experimenter  (see  Figure  4 for  a sample  display).  The 

experimenter  is  usually  looking  for  two  types  of  errors:  an  EMG  signal  may 
appear  obviously  shifted  in  time  relative  to  the  line-up  point  for  some  CODE, 
in  which  case  the  ESEL  list  can  be  checked  for  clerical  errors;  occasionally 
very  large  spikes  may  be  present  in  the  EMG  signals  that  are  apparently  due  to 
shorting  of  the  electrode  tips  in  the  muscle  during  recording.  At  the  end  of 
the  visual  editing  process,  E$MGPAGE  automatically  constructs  a corrected 
digital  mag  tape  that  is  then  used  for  all  output  programs. 

ESMGPRNT:  The  averages  and  the  standard  deviations  for  all  utterance 

types  are  printed  on  a line  printer  from  the  mag  tape.  The  output  is  in 

microvolts  using  a conversion  factor  calculated  from  the  300  microvolt 
calibration  signals  that  have  also  been  sampled  for  each  channel.  For  the 
pressure  channels  the  output  values  are  in  cm.  water. 

ESMGRAPH:  The  task  of  rapid  and  easy  retrieval,  visual  comparison  and 

manipulation  of  average  EMG  signals  from  up  to  40  EMG  experiments  is  done  by 
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ESMGRAPH  using  the  storage  scope.  Up  to  nine  EMG  signals  may  be  displayed 
using  three  distinctive  line  types  on  three  sets  of  axes.  The  storage  scope 
display  can  be  duplicated  on  paper  by  a hard  copy  unit.  (.See  Figure  5 for  a 
sample  display.) 

ESMGCORR:  A correlation  coefficient,  r,  and  its  Fisher  Z-Transform  are 
calculated  and  printed  for  any  two  EMG  averages  and/or  individual  EMG  signals 
on  a single  EMG  experiment.  The  Fisher  Z-Transform  is  useful  in  calculating 
certain  statistics  on  r,  especially  computing  its  average  (.see  McNemar,  1969, 
p.  157-158). 


EVALUATION  EXPERIMENTS 

In  order  to  evaluate  the  assumptions  underlying  this  data  processing 
technique  and  to  make  some  other  decisions  about  the  details  of  the  analysis 
procedures,  two  experiments  were  conducted.  Using  a different  muscle  for  each 
experiment,  EMG  signals  were  collected  simultaneously  from  several  electrodes 
placed  bilaterally  in  the  muscle.  An  attempt  was  made  to  select  muscles  that 
would  contract  uniformly  over  their  length,  using  prior  experience  and 
anatomical  evidence  as  guides;  as  we  will  see  below,  one  of  the  chosen  muscles 
did  not  fulfill  these  conditions. 

The  subject  for  both  experiments  was  a female  native  American  English 
speaker  (the  author).  The  basic  procedures  outlined  earlier  for  collecting 
and  processing  EMG  signals  from  hooked-wire  electrodes  were  followed.  Any 
special  analyses  other  than  the  standard  integration  and  averaging  techniques 
are  described  below. 

In  Experiment  I,  6 electrodes  were  inserted  bilaterally  into  the  levator 
"dimple"  on  the  oral  surface  of  the  soft  palate  and  one  into  the  orbicularis 
oris  (00)  for  reference.  After  inspection  of  the  processed  EMG  signal,  two 
recordings  were  eliminated  from  the  analyses,  one  because  the  signal  level  was 
very  low  (below  50  pv),  and  the  other  because  it  contained  excessive  numbers 
of  erratic,  high  amplitude  spikes  thought  to  arise  from  the  electrodes 
touching.  The  remaining  channels  were  two  channels  from  the  right  levator 
(LPR2  and  LPR3)  and  two  from  the  left  (LPL6  and  LPL7 ) . The  speech  material 
read  14  times  consisted  of  4 nonsense  utterances,  /fimpip/,  / f ipraip/ , 
/fintip/,  /fitnip/  and  one  anomalous  sentence,  "Jean  Teacup’s  nap  is  a snap." 

In  Experiment  II,  6 electrodes  were  inserted  through  the  cutaneous  tissue 
under  the  chin  bilaterally  into  the  mylohyoid  (MY),  and  one  into  the 
orbicularis  oris  (00).  In  this  experiment,  two  MY  channels  were  also 
eliminated  from  analysis,  both  because  of  very  low  amplitude  signals.  The 
remaining  channels  were  two  from  the  right  MY  (MYR6  and  MYR7)  and  two  from  the 
left  (MYL3  and  MYL5).  The  speech  material  consisted  of  a text  loaded  with 
phonemes  for  which  the  mylohyoid  was  thought  to  be  active,  that  is,  / i / and 
apical  and  velar  consonants.  The  subject  read  the  text,  followed  by  a list  of 
13  short  phrases  excerpted  from  the  text,  14  times.  Ten  2-second  utterances 
from  the  text  and  all  the  phrases  were  selected  for  computer  processing. 


131 


— I.l  Mill 


- 


EVALUATION  OF  SAMPLING  AND  INTEGRATION  TECHNIQUES 


Number  of  Bits  Sampled 

Although  12  bits  of  data  are  sampled  and  stored  on  playback,  it  was 
estimated  that  only  7 bits  of  data  would  be  reliably  sampled  because  the  tape 
recorder  playback  amplifiers  are  limited  to  approximately  40  dB  signal-to- 
noise  ratio.  To  verify  this  estimate,  a single  utterance  from  Experiment  I 
was  sampled  and  processed  with  no  digital  smoothing  twice  in  the  same  day. 
Fifty  continuous  samples  were  compared  for  the  two  passes  on  all  seven 
channels.  The  6 most  significant  bits  were  identical  so  the  number  of  times 
that  the  seventh  bit  differed  was  counted.  This  bit  differed  on  the  average 
of  1 in  5 comparisons  for  all  channels.  This  is  equivalent  to  an  average  of 
6.8  bits  reliably  sampled,  with  a range  of  6.7  to  6.9  bits  across  channels. 
Thus  the  use  of  7 (of  the  12)  bits  sampled  in  calculating  the  digital 
smoothing  and  averaging  of  the  EMG  signals  was  confirmed  as  being  significant. 

Linear  Versus  RC  Integrators 

Another  purpose  of  these  experiments  was  to  compare  the  performance  of 
linear-reset  integrators  with  the  RC  integrators  we  formerly  used  in  EMG 
exper imen 1 s . These  RC  integrators  are,  we  believe,  quite  similar  to  those 
used  by  several  other  investigators.  We  expected  to  find  a superiority  of  the 
linear-reset  integrators  for  several  reasons.  The  RC  integrators  must  be  set 
to  a time  constant  that  causes  a fixed  time  lag  in  the  EMG  signals  relative  to 
other  physiological  events.  For  the  linear-reset  integrators,  since  the 
digital  integration  is  both  backwards  and  forwards,  there  should  be  no  shift 
of  the  EMG  signals  in  time.  Furthermore,  the  RC  integrators  have  a filter 
. characteristic  that  does  not  provide  as  good  signal  integration  as  the  linear 

integrators,  which  theoretically  provide  "true"  time  integration.  The  superi- 
* ority  of  linear  integration  for  EMG  signals  in  a feedback  control  application 

was  demonstrated  by  Kreifeldt  (1971)  for  isotonic  contractions  in  human 

subjects . 

An  informal  comparison  of  the  output  of  the  two  kinds  of  integrators 
confirmed  the  expected  superiority  of  the  linear  integrators.  All  EMG  signals 
from  both  Experiment  I and  II  were  processed  through  both  types  of  integra- 
tors. The  time  constant  for  digital  smoothing  was  25  mkec  in  Experiment  I, 
and  35  msec  in  Experiment  II,  chosen  according  to  the  criteria  discussed 

below.  Visual  inspection  of  the  averaged  EMG  curves  for  the  same  utterances 
shows  a distinct  time  lag  for  the  RC  integrator  signals.  Using  computer- 

driven  graph  pen  facilities  in  E$MGRAPH , the  lag  was  estimated  for  each  of  16 

measurements,  11  from  Experiment  I and  5 from  Experiment  II.  The  average  lag 
was  calculated  as  24  msec  in  rising  portions  of  the  curves,  and  21  msec  in 
falling  portions.  The  definition  of  the  EMG  curves  was  sharpened  for  the 
linear  integrator  processing.  To  estimate  the  increase  in  peak  height,  the 
difference  in  microvolts  between  the  peaks  of  the  1 inear-versus-RC  integrator 
signals  was  measured  for  the  same  16  curves  mentioned  above.  Calculating  the 
increase  as  percent  of  this  difference  divided  by  the  peak  value  of  the  RC 
curve,  the  average  peak  height  increase  for  the  linear  integrators  was  21 
percent . 
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T ime  Constant  of  Digital  Integrat ion 

The  time  constant  for  digital  integration  is  not  preset,  but  chosen 
separately  for  each  experiment.  We  analyzed  some  of  the  data  from  Experiment 
II  to  see  if  a time  constant  could  be  selected,  based  on  the  criterion  of  the 
minimum  time  constant  necessary  for  effective  smoothing.  A number  of  vari- 
ables were  considered  likely  to  influence  the  value  in  different  experiments. 
As  mentioned  previously,  the  number  of  repetitions  in  the  average  is  an 
important  variable,  although  the  maximum  value  is  set  between  15  and  20  due  to 
the  limits  of  endurance  of  subjects  (and  experimenters).  Other  variables  to 
be  considered  are  signal  level,  and  tyne  and  length  of  the  speech  material. 
It  was  apparent  from  visual  inspection  of  the  EMG  signals  using  ESMGPAGE,  that 
more  smoothing  was  needed  for  signals  with  higher  levels  of  activity.  Thus, 
two  of  the  mylohyoid  electrodes  (MYL3  and  MYR6 ) with  signal  peaks  around  500 
to  600  pV,  were  chosen.  Two  of  the  2-sec  sentences  and  two  of  the  1-sec 
phrases  for  the  which  the  mylohyoid  showed  activity  throughout  the  utterances 
were  selected  for  analysis.  There  were  14  repetitions  in  each  average. 

The  procedure  was  to  calculate  the  average  several  times  using  different 
time  constants.  The  time  constants  selected  were  5 msec  (unsmoothed),  15,  25, 
35,  45,  55,  and  95  msec.  For  each  time  constant,  electrode  channel  and 
utterance,  a correlation  analysis  was  made.  The  correlation  coefficient,  r, 
and  its  Fisher  z-transform,  z,  were  calculated  between  an  EMG  average  and  an 
individual  signal  from  the  average.  As  the  time  constant  increases,  r 
increases  because  both  signals  have  less  ripple,  that  is,  less  uncorrelated 
noise.  To  obtain  a function  representing  this  increase  in  r,  ten  individual 
EMG  signals  were  correlated  with  their  mutual  average  for  each  time  constant, 
and  an  average  of  these  correlation  coefficients  was  computed  using  the  z 
values.  The  increase  in  r represents  the  extent  to  which  the  individual  EMG 
signals  making  up  an  average  are  becoming  more  like  the  average  as  the  time 
constant  is  increased. 

Figure  2 shows  the  average  r's  as  a function  of  the  time  constant  for 
electrode  MYL3;  on  MYR6  the  correlation  functions  for  the  four  utterances  did 
not  overlap  as  extensively  as  those  of  MYL3 , but  otherwise  there  was  little 
difference  and  they  are  not  presented.  Average  r increased  most  rapidly  for 
time  constants  less  than  25  msec.  The  highest  value  of  average  r occurred  for 
different  sentences  for  the  two  electrodes,  but  a difference  in  average  r 
functions  between  sentences  and  phrases  was  not  observed. 

To  determine  the  effectiveness  of  increased  digital  smoothing,  a function 
showing  the  average  improvement  of  r with  increased  smoothing  was  calculated. 
Using  the  z values,  the  average  z's  were  calculated  for  10  msec  increments  in 
time  constant.  Successive  10  msec  pairs  were  subtracted.  The  z differences 
obtained  for  each  pair  did  not  differ  for  sentences  versus  phrases,  or  across 
channels  MYL3  and  MYR6 . Thus  a grand  average  of  z differences  was  calculated, 
converted  back  to  r values,  and  plotted  on  Figure  3. 

The  curve  on  Figure  3 appears  to  have  an  elbow  about  35  msec,  such  that 
some  improvement  occurs  with  time  constants  up  to  35  msec,  but  little 
improvement  for  greater  time  constants.  Thus  for  these  EMG  signals,  a time 
constant  of  no  more  than  35  msec  should  V?  chosen;  larger  time  constants  of 
digital  integration  will  probably  not  remove  substantially  more  ripple  from 
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Figure  2:  Average  values  of  the  correlation  coefficients  computed  between  ten 

individual  utterances  and  their  mutual  average  as  a function  of 
increasing  the  time  constant  of  digital  smoothing. 


the  average  signal,  but  will  certainly  reduce  the  time  resolution  of  the  peaks 
and  valleys  of  the  signal. 

EVALUATION  OF  AVERAGING  TECHNIQUE 

Compar ison  of  Average  EMG  Signals  from  Different  Electrode  Placements 

The  purpose  of  calculating  an  averaged  EMG  signal  for  an  utterance  is  to 
obtain  a reliable  and  relatively  noise  free  EMG  signal  that  we  believe  is 
representative  of  the  underlying  neuromotor  command  to  the  muscle.  Using  the 
model  previously  described  (beginning  p.  3),  we  present  an  analysis  that 
verifies  the  above  statement  when  EMG  signals  are  observed  simultaneously  from 
several  electrodes  in  the  same  muscle. 

Writing  equation  (3)  again,  the  average  EMG  signal  for  a specific 
electrode  is: 

m 

V'>  ■ ; sPi(t)  <3) 


Let's  consider  the  case  where  a muscle  contracts  uniformly  over  its  length. 
In  this  case,  the  motor  units  should  fire  similarly  throughout  the  muscle 
(within  the  normal  stochastic  limits),  and  the  mean  value  function  represented 
by  Ep(t)  should  be  the  same  for  any  electrode  placement  p.  Thus  we  can  define 
S(t)  such  that: 


S(t)  . Ep(t) 


l 1U 

1 l 

m i-1 


Spi(t) 


(4) 


for  all  electrode  placements  p in  a uniformly  acting  muscle.  We  can  think  of 
S(t)  as  an  EMG  representation  of  the  underlying  neuromotor  command  for  this 
utterance . 


It  might  appear  that  the  most  obvious  way  to  test  whether  Ea(t)- 
Efc(  t)  = S(t)  for  electrodes  a and  b would  be  to  subtract  Ea(t)-Eb(t)  and  see 
if  the  resultant  function  is  zero  valued.  However,  amplitude  values  for  any 
electrode  depend  in  large  part  on  the  number  and  proximity  of  the  motor  units 
firing  to  the  electrode.  Rather  than  try  to  develop  an  appropriate  normaliza- 
tion technique  for  signal  amplitude,  it  was  decided  to  use  the  correlation 
coefficient,  r,  which  is  sensitive  to  signal  variation  but  not  to  absolute 
amplitude,  to  test  the  hypothesis.  Two  analyses  are  presented,  one  for 
correlations  between  separate  EMG  electrode  averages,  and  the  other  for 
correlations  between  two  groups  of  repetitions  from  the  same  electrode. 

To  see  whether  Ep(t)  is  independent  of  electrode  placement,  the  correla- 
tion coefficient,  rab,  is  calculated  between  the  averages  from  two  different 
electrodes,  a and  b.  The  correlation  coefficient  for  EMG  averages  Ea(t)  and 
Ej,(t)  is  defined  as: 


(5) 


r * = Cov[E  (t) , E (t)] 

rab  a b 

/V [ Ea ( t ) ] V[Eb(n)] 

where  Cov  stands  for  covariance  and  V for  variance.  But  if  our  hypothesis 
that  Ea(t)  = Eb(t)  = S(t)  is  correct,  then 

rab  = (Cov [ S (t) , S(t)]  = i 
V[S(t)] 


Results  from  Experiment  I 

Correlation  coefficients  were  calculated  for  average  EMG  signals  between 
all  possible  pairs  of  the  four  electrodes  placed  in  the  levator  palatini  in 
Experiment  I.  Table  1 presents  the  coefficients  for  all  5 utterances.  The 
coefficients  have  been  grouped  according  to  electrodes  that  were  in  the  same 
side  of  the  levator,  2 & 3 and  6 & 7 , or  different  sides,  2 & 6,  2 & 7,  3 & 6, 
3 & 7.  The  average  value  of  the  coeffiecients  was  .95. 


TABLE  1:  Experiment  1:  Correlation  Coefficients  for  Average  EMG  Signals  for 

Levator  Palatini 


Electrode 


Placement 

/ f impip/ 

/ fipmip/ 

/ f intip/ 

/ f itnip/ 

Sentence 

Same  Side 

2,3 

.93 

.96 

.96 

.95 

.95 

6,7 

.99 

.99 

.99 

.99 

.98 

Different  Side 

2,6 

.86 

.94 

.93 

.94 

.96 

2,7 

.86 

.93 

.94 

.92 

.93 

3,6 

.94 

.96 

.95 

.94 

.94 

3,7 

.95 

.97 

.95 

.95 

.93 

The  square  of  the  average  correlation  coefficient,  r2  = .90,  can  be 
interpreted  to  mean  thac  90  percent  of  the  variance  of  Ep(,t)  from  one 
electrode  is  predictable  from  the  variance  from  another  electrode.  (Note  that 
variance  here  rerers  to  the  amplitude  distribution  of  one  signal  about  its 
mean  value  calculated  over  the  time  period  sampled.)  That  is,  the  hypothesized 
S(t)  signal  could  account  for  90  percent  of  the  variance  of  the  EMG  averages 
over  time  and  other,  unknown  sources  account  for  the  remaining  10  percent  of 
the  variance. 
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One  source  of  this  variance  might  be  that  the  levator  is  functionally 
differentiated  and  there  is  some  real  variation  in  the  neuromotor  commands  to 
the  muscle  as  observed  by  the  different  electrodes.  In  this  regard,  note  that 
the  two  electrode  pairs  on  the  same  side  of  the  levator  had  a higher  average 
correlation,  .98,  than  did  the  four  electrode  pairs  in  the  different  sides, 
.94.  These  values  are  significantly  different  by  a t-test,  t = 4.26, 
p <.001.  This  suggests  that  there  was  some  real  variation  in  the  average  EMG 
signals  between  electrodes,  since  the  results  are  compatible  with  the  expecta- 
tion that  there  would  be  more  variation  between  the  bilateral  muscle  pairs 
than  within  the  same  muscle. 

Results  from  Experiment  II 

The  results  from  the  mylohyoid  muscle  observed  in  Experiment  II  did  not 
simply  replicate  Experiment  I.  It  was  obvious  from  visual  inspection  of  the 
EMG  signals  that  there  was  substantial  variation  between  the  four  electrode 
placements.  There  appeared  to  be  a functional  differentiation  of  EMG  activity 
for  the  electrode  placements  going  from  the  chin  posteriorly  for  speech 
segments  including  high  front  vowels  and  apical  consonants,  but  uniform  EMG 
activity  for  the  velar  consonants.  Although  this  unanticipated  finding  should 
be  confirmed,  it  is  anatomically  reasonable,  since  the  fibers  are  distributed 
in  their  origin  and  in  section-spatial ly.  Correlation  coefficients  were 
calculated  for  the  four  pairs  of  mylohyoid  electrodes  and  the  results  can  be 
interpreted  as  supporting  the  above  hypothesis.  The  average  of  the  correla- 
tion coefficients  for  four  sentences  for  adjacent  anterior-posterior  electrode 
placements  was  .92.  The  average  for  the  extreme  anterior-posterior  pair  for 
the  same  sentences  was  .72. 

Further  Analysis 


The  above  results  from  Experiments  I and  II  raise  the  question:  Can  a 
high  correlation  between  two  separate  electrode  signals  determine  whether  or 
not  both  signals  represent  the  same  underlying  neuromotor  command'/  For 
example,  is  the  average  r across  all  4 levator  electrodes  of  .95  high  enough 
when  there  is  still  10  percent  unaccounted  variance  between  electrodes? 

A separate  analysis  was  developed  to  substantiate  when  two  average 
signals  from  electrodes  a and  b were  significantly  different  from  one  another. 
The  individual  EMG  signals,  ep^(t),  are  arbitrarily  assigned  to  two  groups  of 
equal  size  (5J)  designated  "o"  for  odd  and  "e"  for  even.  New  averages  for  each 
group  are  calculated,  Eao(tl,  EaeC  t ; , Ebo(tl  and  Ebe(tl.  Correlation  coeffi- 
cients are  calculated  between  all  pairs  of  the  averages.  The  correlation 
coefficients  for  the  odd  and  even  groups  from  the  same  electrode  are  compared 
by  means  of  a t-test  to  the  correlation  coefficients  for  groups  from  the 
different  electrodes.  If  the  correlation  coefficients  do  not  differ  signifi- 
cantly from  each  other,  we  may  say  that  the  signal  variance  from  one  group  for 
electrode  a may  be  predicted  equally  from  the  other  group  on  electrode  a,  or 
from  either  group  on  electrode  b.  Thus  we  infer  that  a single  average 
neuromotor  command,  S(t),  was  sampled  at  both  electrode  placements. 

The  EMG  signals  for  the  anomalous  sentence  from  Experiment  I were 
analyzed  for  all  four  electrode  placements  in  the  levator  palatini,  labeled 
here  as  2,  3,  6,  and  7.  There  were  10  repetitions  ot  the  sentence  in  the  EMG 
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averages,  each  2 sec  long.  Figure  4 presents  the  EMG  averages  and  4 of  the  10 
utterance  repetitions  for  electrode  6 (LPL6). 


The  10  utterance  repetitions  were  split  into  odd  and  even  lists  on  each 
channel  and  new  averages  were  calculated.  The  correlation  coefficients  for 
the  averages  from  the  pairs  of  odd  and  even  channels  were  calculated  and 
appear  in  Table  2.  T-tests  were  calculated  between  the  means  of  the 
correlation  coefficients  (using  the  z-Transform)  for  EMG  average  from  the  same 
electrode,  and  the  averages  from  the  different  electrodes.  (For  example,  a t- 
test  was  calculated  between  the  mean  from  the  same  electrodes,  l/2[r(2e, 
2o)  + r(3e,  3o)],  and  mean  from  different  electrodes,  1/2  [r(2e,  3oJ  + r(2o. 


3e)  ] , where 

e = even  and 

0 = 

odd . ) 

None  of 

the  t-: 

tests  was  significantly 

different 

at 

better  than  a 

.25 

level 

( two- 

tailed  test) 

* 

TABLE  2: 

Experiment  I:  Correlation 

Coefficients  Between  the 

Averages  for  the 

Odd  and 

Even  Lists  for  Pairs  of 

Electrodes , 

2,  3,  6 , 

, and  7 . 

2 

3 

6 

7 

3 

7 

2 

.92 

.91 

6 

.92 

.89 

3 

.89 

.79 

3 

.86 

.89 

7 

.92 

.91 

7 

.92 

.91 

-A 

3 

6 

2 

7 

2 

6 

3 

.89 

.82 

2 

.92 

.84 

2 

.92 

.88 

6 

.92 

.92 

7 

.89 

.91 

6 

.89 

.91 

The 

EMG 

signals  from 

one  sentence  from  Experiment 

II  were  analyzed  in  the 

same  way 

as 

the 

sentence 

from 

Experiment 

I. 

The  "sentence" 

was  the  first  2 

sec  from 

the 

sentence  beginning 

"Eve 

and  Clayton  left 

Kansas 

for  the  ..."  The 

line-up  point  was  the  onset  of  voicing  in  "Eve."  There  were  14  repetitions  of 
this  sentence.  The  four  electrode  placements  in  the  mylohyoid  are  labeled  3, 
5,  6 and  7.  As  mentioned  in  discussion  of  the  averaging  technique,  correla- 
tions of  EMG  averages  between  pairs  of  electrodes  in  Experiment  II  were  not  in 
general  high.  This  sentence  was  chosen  because  the  coefficients,  as  seen  in 
Table  3,  were  higher  than  most  others,  probably  because  there  were  more  velar 
consonants.  However,  pairs  3-6  and  3-7  had  such  low  coefficients,  .77  and  .70 
respectively,  that  they  were  dropped  from  further  analysis. 
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AVERAGE  EMG  SIGNALS 


F igure 


AJ'VA'VAi 


INDIVIDUAL  EMG  SIGNALS 


No  Digital  Smoothing 


TC  = 25msec 


: EMG  signals  from  LPL6  for  the  sentence  "Jean  Teacup's  nap  is  a 

snap."  The  average  signal,  repeated  at  the  top,  is  calculated  from 
10  repetitions  using  a 25  msec  time  constant.  The  line-up  point  is 
the  offset  of  voicing  in  Jean.  On  the  left,  sentence  repetitions  5 
through  8 have  no  digital  smoothing.  On  the  right,  repetitions  5 
through  8 have  a 25  msec  time  constant. 


TABLE  3:  Experiment  II:  Correlation  Coefficients  Between  the  Average  EMG 


Signals  for 

Pairs  of 

Electrodes  3, 

5,  ' 

Electrodes 

5 

6 

7 

3 

.87 

.77 

.70 

5 

.92 

.89 

6 

.97 

The  utterances  were  divided  into  odd  and  even  lists  and  the  appropriate 
EMG  averages  calculated  for  the  remaining  electrode  pairs.  Correlation 
coefficients  between  these  averages  are  presented  in  Table  4.  Calculations  of 
t-tests  between  electrode  pairs  showed  that  two  pairs  of  EMG  averages  were 
significantly  different  from  one  another;  t C 3 — 5 ) = 8.49,  p < .02,  and  t(5- 
7)  = 5.70,  p < .05.  For  the  other  two  pairs,  t-tests  showed  them  to  be 
significantly  different  at  the  .1  < p < .05  level;  t C 6— 7 ) = 3.30  and  t(5- 
6)  = 3.56.  However,  both  of  these  calculations  are  based  on  correlation 
coefficients  with  four  significant  digits  because  the  rounded  two  digit 
entries  in  Table  4 were  identical.  It  seems  that  the  t-test  is  not  valid  in 
this  case  since  the  population  variances  of  the  same  and  different  pairs  are 
most  likely  unequal  (.see  McNemar,  1969,  p.  117).  Intuitively,  it  seems  that 
pair  6-7  with  an  r = .97  between  the  average  signals  should  not  be  signifi- 
cantly different  using  this  odd-even  analysis,  contrary  to  the  above  which 
showed  a difference  at  the  .1  level.  Therefore,  if  the  arbitrary  division  of 
EMG  signals  (here  as  odd  or  even)  results  in  identical  r’s  for  either  the  same 
or  different  sides,  the  t-test  should  not  be  used,  and  perhaps  a different 
division  of  the  lists  would  result  in  unequal  r's.  Unfortunately  an  analysis 
of  pairs  6-7  and  5-6  was  not  carried  out  for  a different  grouping. 


TABLE  4:  Experiment  II:  Correlation  Coefficients  Between  the  Averages  for 

the  Odd  and  Even  Lists  for  Pairs  of  Electrodes  3,  5,  6,  and  7. 

A 


3 

5 

6 

7 

.92 

.83 

.95 

.92 

o 

00 

.93 

.93 

.95 

5 

6 

5 

7 

.93 

.88 

.93 

.85 

GO 

00 

.95 

.84 

.95 
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Thus,  the  results  obtained  from  Experiment  II  showed  that  for  electrode 
pairs  3-5  whose  average  signals  correlated  with  an  r = .87  and  pair  5-7  whose 
r = .89,  the  EMG  averages  can  be  considered  significantly  different  from  one 
another.  In  Experiment  I,  the  r's  for  the  averages  from  all  6 pairs  of 
electrodes  ranged  from  .93  to  .98,  and  all  pairs  were  shown  not  be  signifi- 
cantly different  in  the  odd-even  analysis.  A thorough  study  of  many  electrode 
pairs,  several  utterances  and  different  muscles  using  the  odd-even  analysis 
could  possibly  determine  for  which  high  values  of  r two  EMG  signals  can  be 
considered  the  same. 

Sources  of  Variat ion  Removed  by  Averaging : Speaking  Rate  Var iat ion 

Although  the  averaging  of  EMG  signals  produces  a smooth  signal,  this 
technique  removes  not  only  the  noise  component  but  also  variation  in  the 
individual  EMG  signals  that  might  be  of  interest.  We  know  that  different 
repetitions  of  an  utterance  are  not  identical.  For  purposes  of  discussion, 
consider  that  there  are  two  sources  of  variation  eliminated  by  averaging,  that 
arising  from  normal  utterance  variation  and  that  arising  from  the  noise 
component,  {np(t)}.  The  purpose  of  the  following  studies  was  to  examine  the 
effects  of  normal  speaking  rate  variation  in  the  averaging  technique. 

EMG  signal  averages  are  calculated  after  all  the  repetitions  of  an 
utterance  are  lined  up  with  reference  to  a single  acoustic  event.  When 
speaking  rate  variation  occurs,  signal  peaks  will  become  more  unsynchronized 
the  further  away  they  are  from  the  chosen  acoustic  event.  A small  experiment 
was  conducted  to  examine  the  effects  of  speaking  rate  variation  on  the  peaks 
of  the  averaged  EMG  signal. 

The  anomalous  sentence  from  Experiment  I,  "Jean  Teacup's  nap  is  a snap," 
was  separately  processed  through  all  of  the  programs  using  three  different 
line-up  points:  the  first  line-up  point  was  the  onset  of  voicing  in  "Jean," 
the  second  was  the  onset  of  voicing  in  "snap,"  and  the  third  was  the  end  of 
voicing  after  "Jean."  For  all  four  levator  palatini  electrodes,  the  effects  of 
changing  the  line-up  point  were  the  same  (see  Figure  5 for  signals  from  one 
electrode).  Peaks  became  lower  and  broader  as  they  were  further  removed  in 
time  from  the  line-up  point.  For  example,  seven  peaks  are  clearly  observable 
in  Figure  5A  and  5C,  but  three  of  them  merge  into  one  in  Figure  5B  where  the 
line-up  point  is  at  the  end  of  the  sentence.  The  decrease  in  peak  height  was 
not  uniform  when  the  line-up  point  was  moved  from  one  end  to  the  other.  The 
average  decrease  was  calculated  as  percent  of  (higher  peak  - lower 
peak) / ( lower  peak)  for  two  channels,  LPR2  and  LPL6 . For  the  line-up  point 
extremes  seen  in  Figures  5A  and  5B,  the  average  percent  decrease  was  13 
percent  for  the  first  peak  and  28  percent  for  the  last  peak.  Of  course,  the 
extent  to  which  a peak  is  flattened  will  depend  on  its  breadth  in  the  original 
utterances.  A brief  event  will  be  more  affected  than  one  that  lasts  longer. 

The  following  qualitative  generalizations  may  be  drawn  from  these  re- 
sults. The  determination  of  the  onset  or  offset  of  EMG  activity  cannot  be 
accurately  made  from  the  EMG  average  signals  when  the  distance  from  the  line- 
up point  is  greater  than  about  one  second.  A sampling  window  for  averaging  of 
greater  than  2 seconds  is  not  advisable. 
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Since  effects  of  speaking  rate  variation  are  observed  in  the  average  EMG 
signal,  we  would  like  to  know  if  the  magnitude  of  these  effects  is  so  large 
that  timing  information  in  the  average  EMG  signal  is  seriously  compromised. 
This  question  can  be  partially  answered  by  considering  how  closely  the  time 
course  of  the  individual  EMG  signals  resembles  that  of  the  averaged  EMG 
signal.  (.Timing  variations  between  individual  EMG  signals  can  be  observed  in 
Figure  4.)  Again  the  correlation  coefficient,  r,  may  be  used  to  quantify  this 
resemblance  because  r is  not  sensitive  to  absolute  amplitude. 

The  same  procedure  described  in  section  "Time  Constant  of  Digital 
Integration,"  was  used.  For  the  utterances  and  electrodes  shown  in  Table  5, 
the  correlation  coefficients  for  each  individual  EMG  signal  are  calculated 
with  their  mutual  average  EMG  signal.  From  Experiment  I,  the  correlation 
coefficients  were  calculated  for  10  tokens  of  /timpip/  and  10  of  the  anomalous 
sentence  for  two  electrodes,  2 and  6.  From  Experiment  II,  results  were  taken 
from  Figure  2 for  the  time  constant  value  35  msec  that  was  standard  for  this 
experiment.  The  average  correlation  coefficient  and  the  corresponding  range 
of  correlation  coefficients  for  each  of  10  tokens  were  calculated  for  all  4 
utterances  appearing  in  Figure  2. 


TABLE  5:  Average  Correlation  Coefficients  for  EMG  Signals  from  Single  Utter- 

ances Compared  to  Their  Mutual  Average  EMG  Signals  and  Their 
Corresponding  Ranges. 


Experiment  I, 

TC  = 25  msec 

/ f impip/ 

/ f impip/ 

Sent . 

Sent 

Electrode 

LPR2 

LPL6 

LPR2 

LPL6 

Average  r 

.85 

.88 

.81 

Range  r 

.82  - .90 

.84  - .91 

.71  - .86 

.77  ■ 

Experiment  II. 

, TC  = 35  msec 

Sent . A 

Sent . B 

Phrase  C 

Phra! 

Electrode 

MYL3 

MYL3 

MYL3 

MYL3 

Average  r 

.77 

.72 

.72 

Range  r 

.66  - .85 

.62  - .84 

.58  - .84 

.58 

The  closeness  of  the  resemblance  of  a single  EMG  signal  with  its  average 
can  be  ascertained  from  Table  5.  The  correlations  are  somewhat  higher  for  the 
nonsense  words  than  the  sentences,  presumably  because  the  sentences  are 
longer.  The  square  of  the  correlation  coefficient,  r2 , indicates  the  propor- 
tion of  the  variance  in  the  average  EMG  signal  that  can  be  predicted  from  the 
individual  EMG  signals.  The  remaining  variance,  l-r2,  arises  from  differences 
in  the  time  course  of  the  two  signals,  due  to  speaking  rate  variations  and  the 
remains  of  the  smoothed,  uncorrelated  noise  component  npi(t).  In  this 
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analysis,  we  cannot  tell  what  percent  of  the  uncorrelated  variance  arises  from 
the  speaking  rate  variation,  and  what  percent  from  the  noise  component. 
Calculating  r^  from  Table  5,  for  Experiment  I,  the  proportion  of  variance 
predicted  ranges  from  .77  to  .66,  and  for  Experiment  II,  from  .69  to  .50. 
Thus  the  signal  variance  is  predictable  from  the  individual  EMG  signals. 

SUMMARY 

This  report  describes  a two-stage  EMG  signal  processing  system  developed 
for  speech  research.  In  the  first  stage,  EMG  signals  from  a number  of 
repetitions  of  the  same  utterance  are  amplified,  rectified  and  integrated.  In 
the  second  stage,  the  integrated  signals  are  carefully  aligned  in  time  and 
averaged,  producing  an  average  EMG  signal  . A model  of  the  integrated  EMG 
signal  is  described  as  the  sum  of  a signal  representing  the  motor  unit 
potentials  specifying  muscle  contraction  and  a random  noise  signal.  This 
simple  model  is  used  in  several  correlation  analyses  of  the  integrated  and 
averaged  EMG  signals  to  determine  the  reliability  of  average  EMG  signals  and 
the  sources  of  variance  removed  by  averaging. 

Two  experiments  involving  multiple  electrode  placements  in  the  same 
muscle  provided  data  for  the  analyses.  Some  of  the  conclusions  discussed  may 
be  stated  as  follows: 

Because  of  their  precision,  linear-reset  integrators  haying  a small  time- 
constant  of  5 msec  and  used  in  conjunction  with  a digital  integrator  having  a 
variable  triangular  windov  were  found  to  be  preferable  to  RC  integrators.  The 
digital  integration  specified  had  limited  incremental  value  in  increased 
smoothing  of  the  average  EMG  signals  when  the  time  constants  were  greater  than 
35  msec. 

Average  EMG  signals  sampled  simultaneously  from  several  electrodes  placed 
in  a uniformly  acting  muscle  were  highly  correlated,  with  an  average  r = .95. 
It  was  found  that  speaking  rate  variation  affected  the  time  resolution  of  the 
average  EMG  signal  at  time  offsets  ’.reader  than  approximately  1 sec  from  the 
line-up  point.  On  the  average,  howev>r,  t.?e  individual  EMG  signals  were  found 
to  correlate  highly  with  their  average  EMG  signals,  r's  ranging  from  .72  to 
.88.  Thus,  the  averaging  technique  is  successful  in  producing  a reliable, 
relatively  noise-free  and  undistorted  EMG  signal,  when  a brief  time  window  is 
chosen . 
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Perceptual  Test  of  a Phonological  Rule 
Linda  Shockey 


ABSTRACT 

Synthetic  speech  is  characteristically  produced  in  a highly 
formal,  maximally  differentiated  style.  This  experiment  shows  that 
the  output  of  one  phonological  rule  manifested  frequently  in  conver- 
sational English  (5-assimilation)  can  be  simulated  easily  by  length- 
ening appropriate  word-final  consonants.  Listeners  accept  the  re- 
sulting long  consonants  as  consonant  + 5 clusters.  It  is  suggested 
that  the  inclusion  of  this  and  other  casual  speech  rules  will 
improve  the  naturalness  and  hence  the  "1 istenabil ity"  of  synthetic 
speech . 

It  has  been  noted  by  Wolfram  and  Fasold  (1974),  Gimson  (1962),  Kohmoto 
(1965),  Hubbell  (1950),  and  Shockey  (1974),  among  others,  that  in  relaxed  or 
conversational  speech  a word-initial  [5]  will  assimilate  completely  to  a 
preceding  nasal,  fricative,  or  [1]  and  will  cause  the  nasal,  fricative,  or  [1] 
to  lengthen  as  well.  Examples  from  natural  speech  follow  (taken  from  Shockey 
(1974): 


S # Assimilations: 

course  they  [ ' kccts • ej.  ] 

effects  the  [ifeks-a] 

Z # Assimilations: 


broads  that  [ fcaadz*  ae7  ] 

cause  the  [ kaz* a] 

years  there  [jnz'n] 

N # Assimilations: 


seen  the 
on  their 
line  that 


[ sin- a] 
[an*  ea] 

[ laj,n*aet  ] 


The  5-assimilation  will  presumably  only  occur  when  the  cluster  in  ques- 
tion receives  a relatively  low  degree  of  stress.  Therefore,  if  one  were  to 
construct  a sentence  of  the  sort  "I  said  put  it  in  this  box,  not  in  that 
one!,"  the  assimilation  would  be  less  likely  to  occur.  However,  since 
virtually  all  English  words  beginning  with  [3]  carry  very  little  semantic 
load,  5-assimilation  can  be  expected  to  be  very  frequent.  This  experiment  was 
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conducted  to  investigate  whether  the  lengthening  of  the  assimilating  (word- 
final)  consonant  is  sufficient  to  induce  a cluster  percept.  If  so,  3- 
assimilation  could  easily  be  included  in  a speech  synthesis  strategy  designed 
to  output  natural-sounding  English.  The  specific  condition  examined  here  is 
ubiquitous  in  conversational  speech:  in  the  word  "the",  [3]  assimilates  to  a 

preceding  word-final  consonant  so  that  the  distinction  between  definite  and 
indefinite  articles  is  preserved  mainly  in  the  duration  of  the  final  conso- 
nant. (it  is  also  possible  that  another  cue  may  be  retained  in  the  consonant 
transitions.  Final  alveolar  consonants  may  be  fronted  and  thus  be  more  dental 
in  cases  where  the  interdental  [3]  has  been  assimilated.  This  possibility 
must  be  examined  further) . 

To  test  the  perceptual  effect  of  final  consonant  lengthening,  the 
following  experiment  was  conducted:  two  utterances  by  one  female  speaker  were 

digitized  and  stored  in  a computer-accesssible  file.  These  utterances  were 
"miss  a guy"  I'missga*]  and  "warn  a guy"  [’^aanaga*].  Using  digital  splicing 
techniques,  the  duration  of  the  frication  in  the  first  utterance  was  varied 
from  80  to  200  msec  in  20  msec  steps,  and  the  duration  of  the  low  amplitude, 
low  frequency  portion  of  the  acoustic  signal  that  presumably  corresponded 
art iculator ily  to  the  closure  portion  of  [n]  in  the  second,  varied  from  0 to 
120  msec  in  10  msec  steps.  The  lengthening  was  done  by  holding  the 

transitions  into  and  out  of  the  steady-state  consonant  in  question  and 

repeating  a characteristic  portion  of  the  waveform  judged  to  be  the  center  of 
the  consonant  enough  times  to  give  the  desired  durations.  It  was  judged  that 
the  nasal  closure  began  when  the  waveform  became  smooth  and  lacked  high- 
frequency  components.  This  decision  lies  behind  the  inclusion  of  an  [n]  of  0 
msec.  The  heavily  nasalized  transitions  into  and  out  of  the  closure  were 
sufficient  to  give  an  impression  of  the  nasal  consonant. 

Two  tests  were  constructed.  For  each  test,  each  stimulus  was  included 

four  times  in  a randomized-order  listening  test  with  three  seconds  between 

stimuli.  The  resulting  tests  were  presented  over  headphones  to  30  undergradu- 
ate students  at  Ohio  University.  The  students  were  asked  to  judge  whether  the 
middle  word  in  the  three-word  sequence  was  "a"  [a]  or  "the"  [3a]  . They  were 
instructed  that  the  signal  had  been  degraded  and,  therefore,  that  their 
decision  was  to  be  based  on  which  English  article  the  stimulus  reminded  them 
of  the  most. 

Results  are  depicted  graphically  in  Figure  1.  At  the  top  we  see  results 
for  the  [s]  test  (miss  a guy).  It  demonstrates  that  when  the  [s]  assumes  a 
duration  of  130  msec,  subjects  cease  to  hear  the  sequence  as  containing  the 
indefinite  article  and  begin  to  hear  it  as  containing  the  definite  article. 

Figure  1 (bottom)  shows  the  same  result  at  120  msec  for  the  lengthened' 
nasal  segment.  We  have  thus  been  able  to  induce  the  impression  of  an 
s + 3 cluster  or  n + 3 cluster  by  increasing  the  length  of  the  assimilating 
consonant . 

The  curves  shown  in  Figure  1 represent  data  points  for  all  subjects  who 
responded  systematically  to  the  stimuli.  Of  31  subjects,  5 responded  randomly 
to  both  tests.  Five  additional  subjects  displayed  random  results  for  part  2 
(nasal  + 3 assimi 1 at  ion)  while  performing  adequately  on  part  1.  This  means 
that  there  were  24  subjects  for  part  l and  21  for  part  2.  It  is  difficult  to 
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Number  of  Judgments 


msec 


Figure  1:  Number  of  "a"  and  "the"  judgments  as  the  nasal  segment  in  "warn  a 

guy"  is  lengthened;  number  of  "a"  and  "the"  judgments  as  the 
fricative  segment  in  "miss  a guy"  is  lengthened. 


say  why  many  people  did  not  respond  in  a patterned  fashion  to  these  stimuli; 
perhaps  the  utterance  is  not  long  enough  to  induce  a casual  speech  frame  of 
perception  for  some.  It  seems  unlikely  that  the  frequency  with  which  one 
hears  these  processes  is  a contributing  factor,  since  both  assimilations  are 
extremely  common  in  the  dialect  areas  in  which  the  subjects  live  (Shockey, 
1974)  . 

Practical  applications  are  fcreseeable  for  the  results  of  this 
experiment : 

1)  Since  [5]  is  a difficult  sound  for  learners  of  English  to  perfect,  they 
can  be  taught  quite  early  on  to  assimilate  it  to  appropriate  preceding 
consonants  while  also  lengthening  the  consonant.  This  will  not  only  make 
their  English  easier  to  articulate,  but  closer  to  standard  conversational 
speech,  since  consonant  + [5]  clusters  are  scarce  in  spoken  English.  Of 
course,  nonassimilable  [5]  must  still  be  dealt  with.  (Current  English-for- 
non-natives  texts  regard  6-assimilation  as  substandard.  However,  the 
author  has  observed  it  to  be  ubiquitous  in  the  speech  of  American 
newscasters,  actors,  politicians,  and  others  who  depend  on  effective  oral 
communication. ) 

2)  In  speech  synthesis  strategies,  consonant  + [5]  clusters  can  be  approxi- 
mated by  lengthening  the  consonants  that  participate  in  this  rule,  which 
could  make  the  resulting  speech  phonologically  more  natural,  less  stilted, 
and  easier  to  listen  to  for  extended  periods  (as  is  called  for  in  reading 
machines  for  the  blind).  It  is  likely  that  other  phonological  properties 
of  casual  or  connected  speech  are  equally  easy  to  simulate  and  should  be 
included  in  a synthesis  strategy  that  hopes  to  approximate  natural  speech 
out  put . 
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A Case 


Adaptation  of  the  Category  Boundary  Between  Speech  and  Non-speech: 
Against  Feature  Detectors 

Robert  E.  Remez^ 


ABSTRACT 


Two  experiments  were  performed  employing  acoustic  continua  that 
change  from  speech  to  nonspeech.  The  members  of  one  continuum, 
synthesized  on  the  Pattern  Playback,  varied  in  equal  steps  of  change 
in  the  bandwidths  of  the  first  three  formants,  from  the  vowel  /a/  to 
a nonspeech  buzz.  The  other  continuum,  achieved  through  digital 
synthesis,  varied  in  the  bandwidths  of  the  first  five  formants,  from 
/ ae/  to  buzz.  The  categorical  perception  of  each  continuum  was 
established  by  standard  procedures.  Perceptual  adaptation  on  these 
continua  then  revealed  effects  on  the  category  boundaries  comparable 
to  those  reported  for  speech  sounds.  The  results  are  interpreted  as 
suggesting  that  neither  phonetic  nor  auditory  feature  detectors  are 
responsible  for  perceptual  adaptation  of  speech  sounds. 

INTRODUCTION 


In  their  account  of  the  perceptual  process  underlying  phonetic  identifi- 
cation, Eimas  and  Corbit  (1973)  combined  phonetic  feature  analysis  and 
hypercomplex  cells  (Hubei  and  Wiesel,  1965)  to  yield  phonetic  feature  detec- 
tors, understood  as  special  cortical  devices  tuned  to  "listen"  to  the  acoustic 
stream  and  extract  the  phonetic  building  blocks.  This  claim  seemed  reasonable 
on  several  counts.  First,  the  elusiveness  of  the  acoustic-phonetic  correspon- 
dence argued  that  a higher  order  evaluation  would  be  needed  to  accomplish  the 
extraction  of  the  meaning  from  the  acoustic  stream  (Halle  and  Stevens,  1962; 
Liberman,  Cooper,  Shankweiler,  and  Studdert-Kennedy , 1967).  Second,  a century 
of  efforts  directed  toward  delineating  the  cortical  loci  of  mental  faculties 
established  that  speech  and  language  are  mediated  in  a restricted  cortical 
area  (Penfield  and  Roberts,  1959;  Geschwind  and  Levitsky,  1968;  Witelson  and 
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Pallie,  1973).  Third,  the  nativist  position  of  the  generative  transformation- 
al enterprise  argued  that  rather  detailed  linguistic  knowledge  might  be 
prewired  into  every  human  infant  (Chomsky,  1965;  Chomsky,  1968).  Fourth, 
perceptual  experiments  on  neonates  seemed  to  show  that  infants  were  sensitive 
to  some  phonetic  distinctions  before  any  relevant  experience  (Eimas,  Sique- 
land,  Jusczyk,  and  Vigorito,  1971).  And  fifth,  developments  in  electrophysi- 
ology suggested  that  the  sensitivities  of  single  cortical  cells  are  both 
elaborate  and  correlated  with  the  ecology  of  the  animals  studied  (see  Kuffler, 
1973,  for  a review).  The  notion  of  genetically  pretuned  hypercomplex  cells 
that  translate  auditory  events  into  phonetic  descriptions  neatly  addressed 
these  issues. 

Perceptual  adaptation  of  phonetic  category  boundaries  has  been  the 
technique  by  which  the  properties  of  the  hypothesized  detectors  have  been 
examined^ , but  this  body  of  research  only  marginally  confirms  the  original 
detector  description.  Although  certain  adaptation  effects  have  required  the 
explanation  to  include  a phonetic  level  of  analysis  at  which  particular 
acoustic-auditory  values  are  less  perceptually  significant  (Ades,  1974;  Diehl, 
1975;  Miller,  1975;  Remez,  Cutting,  and  Studdert-Kennedy^ ) f other  research  has 
produced  evidence  of  non-phonetic  adaptation  that  is  fully  compatible  with  any 
of  the  previously  obtained  phonetic  effects  (Ades,  1974;  Pisoni  and  Tash, 
1975;  Bailey,  1975;  Diehl,  1976).  In  short,  the  present  situation  is 
paradoxical.  While  the  underlying  detectors  are  phonetic  by  the  original 
intention  as  well  as  by  occasional  necessity,  some  of  them  may  suffer  acoustic 
fatigue,  and  all  suffer  from  inexplicit  specification  of  their  tuning  curves, 
[in  tonotopically  organized  cells,  the  dimension  for  measuring  sensitivity  is 
frequency  (Woolsey  and  Walzl,  1942);  in  phonetically  organized  detectors,  the 
dimensions  of  analysis  must  correspond  to  those  of  vocal  production,  and  many 
of  these  have  yet  to  be  defined.]  Additionally,  and  perhaps  fatally,  the 
passive  filtration  method  of  phonetic  feature  extraction  in  speech  assumes  a 
simple  relation  betweei  the  acoustic  pattern,  the  phonetic  segments,  and  the 
ordinal  correspondence  between  them.  But  it  is  the  fundamental  point  of  many 
perceptual  studies  that  segmental  identity  typically  is  carried  by  the  sound 
pattern  distributed  across  the  entire  syllable  (for  example,  Cooper,  Delattre, 
Liberman,  Borst,  and  Gerstman,  1952).  This  requires,  in  essence,  that  each 
feature  detector  be  a little  homunculus,  omniscient  on  the  nature  of  the 
context  conditioned  variation  of  its  favorite  feature  of  speech.  For  these 
reasons,  either  the  task  requires  vast  multiplexing,  well  beyond  the  sensible 
(Halwes  and  Jenkins,  1971),  or  the  loss  of  the  appealing  simplicity  of  the 
device  in  order  to  deal  with  the  complexity  of  the  message  structure  of  the 
speech  chain. 


'Haggard  (1967)  first  described  the  paradigm  of  after-effect  research  employ- 
ing speech  sounds,  but  his  rationale  was  completely  independent  of  neurophy- 
siological claims. 
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A test  of  the  hypothesis  that  adaptation  effects  reveal  detectors  tuned 
to  linguistic  features  might  attempt  to  produce  adaptation  outside  the  set  of 
effects  predictable  from  the  feature  inventory.  The  experiments  reported 
here,  which  use  acoustic  continua  from  speech  to  nonspeech  sounds,  satisfy 
this  condition.  Further,  if  acoustic-auditory  explanations  can  be  ruled  out, 
there  would  then  be  reason  to  suggest  an  active,  cognitive  basis  for  this 
effect,  one  which,  if  applied  to  speech,  would  be  compatible  with  a phonetic 
level  of  analysis,  but  incompatible  with  phonetic  feature  detectors. 

EXPERIMENT  la 


Methods 


Subjects . Sixteen  University  of  Connecticut  undergraduates,  whose  parti- 
cipation fulfilled  the  introductory  psychology  course  requirement,  served  as 
listeners  in  this  part  of  the  study.  All  were  native  English  speakers  with  no 
known  speech  or  hearing  disorder  or  psychopathology.  None  had  any  experience 
with  synthetic  speech  sounds  before  the  listening  session. 

Stimuli.  The  Haskins  Laboratories  Pattern  Playback  (Cooper,  Liberman, 
and  Borst,  1951)  was  used  to  synthesize  the  basic  materials^.  This  device 
uses  a tone  wheel  to  generate  the  harmonics  of  120Hz  in  light  intensities 
arrayed  in  a frequency  scale.  A graphic  pattern  selectively  reflects  portions 
of  this  scale,  and  this  reflection,  through  capture  by  a photocell,  is 
transduced  to  a frequency  by  amplitude  by  time  acoustic  signal.  Figure  1 
displays  the  pattern  painted  (Liquitex  Acrylic  Titanium  White  Grumbacher  #4) 
on  the  acetate  belt  (Eastman  Kodak)  and  the  frequency  values  of  the  transduced 
signal.  The  pattern  changes  from  a vowel  /a/,  with  formant  frequency  values 
of  600  Hz,  1200  Hz,  and  2400  Hz,  to  a nonspeech  buzz,  by  modifying  the 
bandwidths  of  the  formants;  initially,  the  bandwidths  are  100  Hz,  and  they 
increase  to  effectively  infinite  width  at  the  end  of  the  pattern.  Figure  2a 
presents  one  spectral  section  through  each  of  the  endpoints. 

This  1-second  sound  was  transferred  to  audiotape  and  then  digitized  by 
the  Haskins  Pulse-Code  Modulator  (PCM)  (Cooper  and  Mattingly,  1969),  sampling 
at  10  KHz  with  low-pass  filtering  at  5 KHz.  A ten-step  continuum  was  then 
made  by  editing  the  digitized  waveform.  Nine  cuts  were  made,  one  every  100 
milliseconds;  the  oscillographic  patterns  were  equated  for  amplitude,  produc- 
ing 10  tokens,  each  of  12  pitch  periods  which  vary  in  formant  bandwidth  as 
does  the  overall  pattern. 

Two  test  tapes  were  created  using  the  PCM  system.  An  identification 
sequence  contained  10  occurrences  of  each  of  the  10  continuum  items,  for  a 
test  of  100  trials,  with  5 seconds  between  trials,  and  9 seconds  following 
each  decade.  A discrimination  sequence  consisted  of  ABX  triads  with  1 second 


^Neither  the  Haskins  Parallel  Resonance  Synthesizer  nor  the  Ove  III  were 
suitable  for  this  study  because  of  hardware-imposed  limits  on  formant 
bandwidth.  These  devices  are  devoted  speech  synthesizers,  and  this  study 
required  a ful 1- frequency  synthesizer,  that  is,  one  with  no  such  restriction. 


between  items,  5 seconds  between  trials,  and  9 seconds  separating  the  decades. 
The  4 permutations  of  each  ABX  comparison  were  represented:  ABA,  ABB,  BAB, 

and  BAA.  In  a one-step  discrimination,  the  comparisons  are  items  1 and  2,  2 
and  3,  3 and  4,  and  so  on;  at  4 trials  per  comparison,  and  9 comparisons, 

there  were  36  trials  in  this  test. 

Procedure  and  Apparatus 

The  sixteen  listeners  were  tested  in  four  groups  of  four.  Sounds  were 
presented  binaurally  over  Grason-Stadler  earphones  activated  by  a Crown  820- 
144  tape  recorder  through  a junction  box  so  that  several  listeners  could 
listen  simultaneously.  Each  session  commenced  with  a briefing  sequence  in 
which  the  endpoints  of  the  continuum  were  each  repeated  ten  times,  in 
alternation.  At  that  time,  listeners  were  asked  to  signify  if  they  had  a good 
idea  of  the  sounds'  identity;  their  instructions  were  to  consider  the  buzz  a 
machine  noise,  and  the  vowel  a synthetic  speech  sound.  The  identification 
test  was  then  administered.  Identifications  were  scored  on  a response  sheet 
as  speech  or  buzz  (S  or  B) . After  a short  intermission,  listeners  were  given 
sample  ABX  sequences  in  which  they  judged  which  of  the  first  two  sounds  was 
identical  to  the  third;  the  continuum  endpoints  were  used  here  to  insure 
clarity  of  the  instructions.  The  actual  test,  begun  when  all  agreed  that  they 
understood  the  instructions,  consisted  of  the  36-trial  discrimination  sequence 
played  twice. 

Results  and  Discussion 

Three  subjects  were  dropped  because  they  either  failed  to  follow  instruc- 
tions (two  subjects  declined  to  judge  difficult  items)  or  responded  at  chance 
on  the  identifications  (one  subject).  Results  for  identification  and  discrim- 
ination appear  in  Figure  3a.  Each  point  is  the  mean  of  130  observations  in 
the  identification  test,  and  108  observations  in  the  discrimination  test. 
These  functions  are  reasonably  consistent  with  the  criteria  for  categorical 
perception  ( Studdert-Kennedy , Liberman,  Harris,  and  Cooper,  1970),  in  that  a 
peak  in  d iscr iminabil ity  occurs  at  the  breakpoint  between  the  identification 
categories.  The  term  "categorical  perception"  describes  a situation  in  which 
the  judged  difference  between  two  entities  is  contingent  on  their  identities 
rather  than  on  the  physical  differences  between  them. 

One  anomaly  in  the  discrimination  function  should  be  addressed,  namely, 
the  troughs  of  the  function.  Given  two  categories,  speech  and  buzz,  the  peaks 
should  number  one,  not  three.  In  the  function  of  Figure  3,  however,  the 

discrimination  peaks  between  items  1 and  2,  and  between  9 and  10,  are  as 
prominent  as  the  category  boundary  peak.  Examination  of  the  waveforms  of 
these  tokens  revealed  amplitude  differences  between  2 of  the  12  pitch  periods 
in  each  item  of  the  pairs.  No  such  difference  could  be  discovered  between 
items  in  the  other  pairs  in  the  set.  One  possible  account,  then,  of  the 
spurious  peaks  is  that  they  result  from  judgments  of  amplitude  rather  than 
spectrum  differences.  If  this  reasoning  is  correct,  the  peaks  can  be 
discounted  in  any  challenge  to  categoricity,  because  the  manipulation  of 
interest  is  spectral,  and  this  particular  discrimination  is  made  on  a 
nonspectral  basis. 


discrimination  plots  for  Playback 


EXPERIMENT  lb 


Methods 


Subjects.  Eight  University  of  Connecticut  undergraduates  were  paid  to 
listen  in  this  part  of  the  study.  They  had  all  participated  in  Experiment  la 
(five  from  the  original  group  could  not  attend  the  listening  sessions  for 
scheduling  reasons). 

Stimuli.  The  10  tokens  from  Experiment  la  were  used.  An  adaptation 

sequence  consisted  of  an  initial  100  repetitions  of  the  adapting  item,  one  of 
the  continuum  endpoints,  at  1-second  intervals.  After  a 10-second  pause, 
which  cued  the  listeners  that  the  identification  trials  were  coming  up,  six 
items  from  the  continuum  were  presented  for  identification,  as  either  speech 
or  buzz  (S  or  B)  . At  the  conclusion  of  the  block  of  six,  there  was  a 10- 
second  pause  followed  by  50  repetitions  of  the  adapting  item,  another  block  of 
six,  and  so  on  for  the  remainder  of  the  test. 

Each  of  the  10  sounds  drawn  from  the  continuum  was  presented  for 
identification  twelve  times,  with  the  exception  of  the  four  most  extreme,  the 
two  on  each  end  that  were  presented  six  times  each.  This  preserves  sensitivi- 
ty in  the  midrange  of  the  continuum  and  shortens  the  test  by  two  blocks  of 
identifications,  to  the  relief  of  the  listener.  With  96  trials  (6  twelves  and 
4 sixes)  there  were  19  blocks  of  six  trials  each.  The  random  order  for  these 
items  was  the  same  in  both  the  speech  and  the  buzz  adaptation  sequences. 

Procedure . An  identification  sequence  was  used  to  determine  a standard 
identification  function  in  each  of  the  two  sessions.  This  was  used  for 
comparison  with  the  adapted  identification  function.  All  subjects  took  part 
in  both  conditions;  half  took  the  /a/  test  first;  half  took  the  buzz  test 
first.  Several  days  separated  the  test  sessions.  The  equipment  and  test 
conditions  were  in  all  other  respects  the  same  as  in  Experiment  la. 

Results 


Each  subject  contributed  two  sets  of  judgments  per  session,  a pretest  set 
and  an  adaptation  set.  To  each  of  these  a standard  ogive  was  fitted,  after 
Woodworth  (1938).  Thus,  two  scores  were  available  for  each  subject  per  test, 
one  mean  of  the  fitted  ogive,  measured  in  continuum  units,  for  the  pretest, 
and  one  for  the  adaptation  test. 

The  curves  for  the  grouped  data  for  each  session  appear  in  Figure  4. 
Each  pretest  plot  represents  the  means  of  80  trials  per  continuum  item;  in  the 
adaptation  nlot,  the  two  extreme  points  on  either  end,  items  1,  2,  9,  and  10, 
are  the  means  of  48  observations  each;  the  remaining  six  medial  points  are  the 
means  of  96  judgments  each.  The  change  in  the  ogive  mean  due  to  speech 
adaptation  was  1.32  continuum  units,  and  that  due  to  buzz  was  .638  continuum 
units  in  the  opposite  direction. 

A two-factor  repeated  measures  analysis  of  variance  was  performed  on  the 
ogive  means,  with  two  levels  of  adaptor  (Speech  or  Buzz)  and  two  conditions 
(Pretest  or  Adaptation).  There  were  no  significant  main  effects,  indicating 
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Figure  4:  Pretest  and  adaptation  test  curves  for  the  Playback  continuum. 





that  the  means,  when  collapsed  across  conditions  or  across  adaptor,  did  not 
depart  from  the  grand  mean.  However,  the  interaction  of  adaptor  by  condition 
was  significant  [ F( 1 , 7 )= 1 2 . 77 1 , p < .01)],  reflecting  the  different,  and 
opposite,  effect  of  each  adaptor  relative  to  the  pretest  mean,  on  the 
adaptation  test  mean. 

Discussion 


In  this  experiment  the  shifts  of  the  category  boundary  do  damage  to  the 
proposal  that  the  units  mediating  the  effect  are  isomorphic  with  the  primi- 
tives of  phonetic  feature  analysis.  Miller  (1975),  for  example,  has  proposed 
that  feature  analyzing  devices  are  arranged  so  that  the  fatigue  of  one  leads 
to  the  relatively  enhanced  strength  of  its  inverse  or  opponent.  However,  in 
the  case  of  /a/  fatigue,  the  resulting  change  requires  going  outside  the 
feature  set  used  in  phonetic  descriptions  to  capture  the  distinction  between 
speech  and  nonspeech.  In  other  words,  a processing  device  arranged  along  the 
lines  of  phonetic  features  could  neither  predict  nor  explain  this  case  of 
adaptation^ . 

Must  we  then  have  recourse  to  a "lower  level"  account  of  the  results? 
Other  authors  have  found  higher  order  descriptions  unwarranted  by  the  adapta- 
tion effects  (Pisoni  and  Tash,  1975;  Bailey,  1975;  Ades5)(  and  since  higher 
order  phenomena  can  be  described  in  lower  nrdpr  physical  terms,  they  have 
ascribed  their  effects  to  alterations  in  sensitivity  at  a lower  level  within 
the  auditory  system.  For  example,  a receptive  unit  with  a "best"  stimulus  may 
tend  to  show,  under  fatigue,  a decreased  sensitivity  to  that  stimulus;  its 
decrease  leads  to  the  release  from  inhibition  of  adjacent,  similar  receptors, 
and  consequently,  to  the  relative  enhancement  of  near  misses  from  the  best 
value.  Thus,  the  auditory  system,  early  on  in  the  course  of  an  analysis,  can 


^Morse,  Kass,  and  Turkienicz  ( 1976)  found  that  an  /e  / or  /i  / adaptor,  but  not 
III , changed  both  boundaries  on  an  / i/-/i/-/e/  continuum;  they  concluded  that 
the  binary  distinctions  "tense"  and  "high"  of  Chomsky  and  Halle  (1968),  which 
would  specifically  rule  out  such  a finding,  had  been  empirically  falsified. 
They  offered  that,  because  their  results  had  shown  continuity  rather  than 
discreteness,  the  feature  system  underlying  the  result  was  necessarily 
continuous,  many-valued  rather  than  binary;  by  extrapolation,  so  was  the 
detector.  Their  approach  was  like  that  of  Cooper  and  Blumstein  (1974)  who 
were  the  first  to  use  adaptation  to  find  perceptual  interactions  and  thereby 
to  define  the  phonetic  features  perceptually,  rather  than  acoustically  or 
articulatorily . Nevertheless,  because  no  language  makes  phonetic  use  of  the 
distinction  [+speech,  -speechl,  it  is  safe  to  say  that  /a/-buzz  adaptation 
does  not  require  that  we  posit  a new  feature;  rather,  it  undermines  the 
interpretation  of  speech  adaptation  effects  in  terms  of  phonetic  features. 

^Ades,  A.  E.  Source  assignment  and  feature  extraction  in  speech, 
(unpublished  manuscript). 
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yield  a mistransformed  description  which  the  unsuspecting  analyzers  in  the 
next  step  are  helpless  to  reverse.  [The  actual  neurophysiology  of  this  is 
still  open  to  question.  One  current  topic  of  investigation  is  whether  the 
reciprocal  inhibition  demonstrable  at  the  VIII  nerve  nucleus  arises  cochlearly 
or  in  the  nucleus  itself  (Mountcastle , 1974).]  By  this  mechanism,  then,  a unit 
or  units  sensitive  to  a portion  of  the  frequency  range,  when  fatigued,  will  be 
less  sensitive  to  the  absolute  values  of  stimulation,  and  will,  via  disinhibi- 
tion,  effectively  amplify  departures  from  the  original  fatigued  values.  For 
example,  a receptor  that  mediates  a rising  second  formant  value  over  the 
course  of  35  msec  (which  specifies  a voiced  bilabial  in  some  circumstances) 
will  be  less  sensitive  when  fatigued  to  values  that  exactly  conform  to  the 
pattern  of  fatigue.  Disinhibit ion  of  neighboring  receptors  would,  in  effect, 
boost  receptor  responses  to  second  formant  transitions  that  depart  by  small 
amounts  from  the  fatigued  values.  This  will  be  costly  in  terms  of  the 
perceptual  outcome  only  in  borderline  cases,  that  is,  those  in  which  the 
fat igue-disinhibition  throws  the  pattern  over  the  line  from  one  category  into 
another.  This  explanation  insists  that  the  auditory  transcription  which  the 
phonetic  system  is  given  to  work  with  has  been  irretrievably  altered. 

However,  if  this  reasoning  is  applied  to  the  adaptation  by  /a/  and  buzz, 
a curious  situation  arises.  Fatigue  caused  by  buzz  should  decrease  sensitivi- 
ty throughout  the  range  of  frequencies  used;  no  frequency-specific  effects 
should  occur.  Indeed,  the  auditory  view  of  adaptation  would  predict  no 
adaptation  at  all. 

On  the  other  hand,  fatigue  caused  by  /a/  should  eat  holes  in  the  "neural 
spectrogram"  of  the  buzz,  decreasing  sensitivity  at  600  Hz,  '200  Hz,  and  2400 
Hz.  If  a listener  were  then  presented  with  the  buzz,  he  should  hear  a pattern 
the  inverse  of  /a/,  with  formants  at  300  Hz,  900  Hz,  and  1800  Hz6 . In  this 
case,  if  the  listener  judges  the  sound  on  the  basis  of  the  acoustic  feature  of 
presence  or  absence  of  formant  structure,  then  the  auditory  point  of  view 
predicts  that  the  boundary  should  move  toward  the  buzz,  since  the  fatigued 
spectral  receptors,  even  in  this  extreme  case,  might  be  expected  to  retain  a 
pattern  showing  acoustic  maxima  and  minima.  However,  precisely  the  reverse 
boundary  movement  was  actually  observed. 

The  present  experiment,  therefore,  produces  a situation  unique  in  the 
adaptation  literature.  While  the  conventional  approach  has  been  to  suspect 
auditory  processes  by  default  whenever  a phonetic  account  fails,  this  is 
obviously  not  possible  here.  The  listener  must  be  judging  the  sounds  on  other 
than  a simple  acoustic  basis.  The  perception  of  the  novel  distinction  between 
/a/  and  buzz  required  here  indicates  a stable  perceptual  '-apacity  that  can  be 
reconfigured  to  suit  the  demands  of  the  particular  situation.  Neither 
phonetic  feature  nor  acoustic  property  detectors  can  be  reconciled  with  this 
tyne  of  perceptual  flexibility. 


^This  pattern,  when  rendered  by  the  Playback,  does  not  sound  like  a speech 
sound . 
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Based  on  the  foregoing,  there  are  several  motives  for  extending  this 
investigation.  First,  the  anomalous  discrimination  peaks  mildly  threaten  the 
claim  of  categoricity  for  this  perceptual  distinction.  By  implication,  the 
explanation  that  the  adaptation  effect  is  judgmental  rather  than  perceptual, 
cannot  be  confidently  rejected.  Second,  the  artificiality  of  the  speech 
synthesized  on  the  Playback  may  be  a factor  to  consider.  Because  the 
Playback,  however  phonetically  identifiable  its  message  may  be,  has  a voice 
qualicy  unlike  that  of  any  person,  its  use  in  an  experiment  of  this  kind  may 
produce  synthesizer  artifacts.  As  a precaution,  then,  it  would  be  valuable  to 
try  this  procedure  with  a more  natural  sounding  synthesizer.  Third,  the 
possibility  that  this  effect  is  restricted  to  /a/,  that  /a/  might  intrinsical- 
ly be  more  nonspeechlike  than  other  vowels,  could  be  assessed  by  using  a 
different  vowel  in  the  same  paradigm.  On  these  accounts,  Experiment  II  was 
performed . 


Methods 


EXPERIMENT  Ila 


Subjects . Eight  University  of  Connecticut  undergraduates,  not  those  of 
Experiment  I,  were  paid  for  their  participation.  All  were  naive  with  respect 
to  synthetic  speech. 

St imuli . The  software  synthesizer  of  Fisher  and  Engebretson  (1975), 
modified  to  permit  variable  parameterization  of  all  five  formants,  was  used  to 
make  the  acoustic  tokens.  This  program  calculates  a digital  wave  from  user- 
determined  parameters  of  source  frequency,  formant  frequency  and  bandwidth, 
and  overall  duration  and  amplitude.  The  digital  wave  is  then  converted  to 

audio  via  a PCM  system.  These  programs,  implemented  by  Joe  Kupin  and  Hal 
Tzeutschler,  run  on  the  University  of  Connecticut  Language  and  Psychology  Data 
General  NOVA  2. 

A nine-step  continuum  from  /a/  to  buzz  was  made  by  successive  50  Hz 
increments  in  t h . formant  bandwidths,  starting  from  an  initial  bandwidth  of 
100  Hz  for  each  formant.  Duration  was  140  msec;  overall  amplitude  was  45  d B ; 
fundamental  frequency  was  120  Hz;  formant  frequencies  for  the  vowel  were 
Fj : 750  , F2:1550,  Fj:2460,  F4:3500,  and  F5:4500.  The  audio  output  was  trans- 
ferred to  the  Haskins  PCM  via  Ampex  tape  recording,  to  permit  algorithmic 
envelope  shaping.  Each  item  was  sixteen  pitch  periods  (140  msec)  long,  with 
ramp  on  and  off  of  3 periods  (25  msec);  overall  amplitudes  were  equated. 
Spectral  sections  through  the  endpoints  appear  in  Figure  2b. 

The  identification  test  consisted  of  ten  judgments  of  each  of  the  nine 
items  in  random  order.  The  discrimination  test  consisted  of  eight  judgments 
of  each  of  the  eight  one-step  comparisons,  in  random  order. 

Procedure  and  apparatus . The  outline  of  xperiment  la  was  followed. 
Results  and  Discussion 

Figure  3b  displays  the  functions  for  identification  and  discrimination. 
The  identification  plot  displays  the  means  of  80  trials  per  point,  the 
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discrimination  plot  the  means  of  64  trials  per  point.  Inspection  of  the 
figure  will  reveal  that,  relative  to  Experiment  la,  the  peak  at  the  category 
boundary  remains  a property  of  the  discrimination  function,  while  the  peaks  at 
the  extremes  of  the  continuum  have  disappeared  at  the  buzz  end,  and  all  but 
disappeared  at  the  speech  end.  It  is  reasonable  to  conclude  that  this  more 
carefully  controlled  continuum  elicited  true  categorical  perception. 

EXPERIMENT  lib  * 

Methods 

Subjects.  The  eight  listeners  from  Ila  were  paid  for  their  participation 
in  this  section  of  the  study. 

Stimuli.  The  nine-item  continuum  from  Ila  was  used  to  make  the  adapta- 
tion sequences.  These  tests  differed  from  lb  only  in  the  consequences  of 
using  a nine-  as  opposed  to  a ten-step  continuum.  Here,  the  four  most  extreme 
items  were  presented  six  times  each  for  identification  during  adaptation,  and 
the  remaining  five  medial  items  twelve  times  each.  With  84  trials  overall  (4 
sixes  and  5 twelves),  there  were  fourteen  blocks  of  six  trials  each,  which 
alternated  with  the  repeating  adaptation  item,  either  /ae/  or  buzz.  The  random 
order  of  identifications  during  adaptation  was  the  same  in  the  speech  and  buzz 
adaptor  conditions. 

Procedure.  As  in  the  previous  procedure,  each  test  day  began  with  the 
identification  sequence  in  order  to  obtain  a standard  for  comparison  with  the 
adapted  identification;  test  days  were  consecutive.  All  subjects  took  part  in 
both  adaptor  conditions  that  were  counterbalanced  for  order. 

Results 

The  ogive  fitting  method  was  again  used  on  the  two  tests  per  day 
contributed  by  each  subject. 

Averaged  functions  for  both  adaptation  conditions  appear  in  Figure  5. 
Pretest  plots  show  the  means  of  80  trials  per  continuum  item;  adaptation  plots 
show  the  means  of  48  trials  for  items  1,  2,  8,  and  9 and  96  trials  for  3 
through  7.  The  change  in  the  ogive  mean  due  to  speech  adaptation  is  .883 
continuum  units,  due  to  buzz  adaptation  .692  in  the  opposite  direction. 

An  analysis  of  variance  was  performed  on  the  ogive  means,  with  two  levels 
of  each  factor,  adaptor  (Speech/Buzz)  and  mean  (Pre/Post).  The  interaction  is 
the  term  of  interest  here;  with  F( 1 , 7 )=32 . 842 , p < .001.  The  main  effect  of 
adaptor  was  also  significant;  with  F(  1 , 7 )=65 . 858 , p < .001.  The  statistical 
significance  of  the  adaptor  term  was  due  to  the  close  correspondence  of  the 
two  pretest  means,  which,  when  averaged  with  the  adaptation  means,  clearly 
reflect  the  differential  effects  of  adaptation.  (Experiment  lb  showed  no  such 
significance  for  this  term  because  the  pretest  means  varied  in  opposition  to 
the  adaptation  means,  thus  cancelling  the  effect  of  adaptor  upon  averaging.) 
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Discussion 


This  study  with  software  synthesized  sounds  strengthens  the  original 
argument  made  from  Playback  data.  The  results  show  that  it  is  neither  the 
artificiality  of  the  speech  synthesizer  nor  the  particular  vowel  involved  that 
enables  listeners  to  treat  the  present  acoustic  continua  in  the  same  fashion 
as  continua  of  proper  speech  sounds.  It  is  this  comparability  which  suggests 
the  interpretation  that  a speech/nonspeech  detector  may  be  responsible  here,  a 
detector  just  like  those  that  putatively  underlie  phonetic  adaptation. 
However,  an  auditory  feature  account  is  ruled  out  by  the  argument  presented 
earlier,  leaving  either  (1)  a phonetic-type  detector  or  (2)  a detectorless 
explanation  as  the  readily  apparent  alternatives  to  consider. 

A phonetic  detector  explanation  here  would  require  the  extension  of  the 
detector  inventory,  since  a speech/nonspeech  feature  is  not  found  in  linguis- 
tic analysis.  The  distinction,  in  fact,  is  not  even  truly  linguistic,  in  the 
sense  of  distinctive  feature  theory,  but  it  certainly  j_s  a feature  of  human 
perceptual  sensitivity,  and  on  that  basis  might  be  seen  as  potentially 
detector-mediated.  However,  the  existence  of  perceptual  sensitivity  should 
not  be  the  only  criterion  for  postulating  a detector.  The  very  advantage  of 
this  style  of  pattern  recognition  is  that  it  makes  infinite  use  of  finite 
means;  if  a new  detector  is  to  be  added  to  the  set  at  every  new  discovery, 
then  the  contradiction  of  an  indefinitely  expandable  finite  means  reduces  the 
attractiveness  of  the  model.  The  device  required  by  these  data  can  preserve 
its  economy  only  if  it  has  a small  group  of  detectors,  tuned  to  speech,  set  in 
opposition  to  a small  group  tuned  to  nonspeech.  On  this  account,  the  search 
for  independent  confirmation  of  this  organizational  plan  does  not  find  the 
neurophysiology  encouraging.  Although  there  have  been  discussions  of  single- 
cell mediation  of  all  perception,  along  the  lines  of  innate  taxa  (Stent, 
1975),  as  well  as  descriptions  of  arrays  of  phonetic  single  units  (Miller, 
1975),  no  proposal  has  yet  been  made  to  oppose  speech  neurons  and  nonspeech 
neurons.  In  fact,  some  claims  for  uniqueness  of  the  speech  neurology  imply 
that  the  speech  processor,  whatever  it  may  be,  is  separate  from  the  nonspeech 
processor  (Milner,  1962).  Speech,  in  this  view,  is  a mode,  like  vision  or 
audition,  and,  by  analogy,  interacts  with  other  modes  but  is  independent  of 
them.  In  short,  a vast  opponent  process  system  for  speech/nonspeech  is  not  to 
be  endorsed  on  the  basis  of  any  current  view,  and  it  may  be  presumed,  in 
addition,  that  such  a system  is  unlikely  to  exist  given  what  is  already  known. 

Finally,  the  only  direct  evidence  for  feature  detectors  in  speech,  as 
opposed  to  the  invitation  to  such  a conceptualization  offered  by  neurophysio- 
logical metaphor,  is  the  selective  adaptation  work.  Boundary  shifts  occasi- 
oned by  adaptation  are  precisely  the  effects  that  would  permit  the  perceptual 
correlates  of  phonetic  feature  manipulations  to  be  recast  as  the  products  of 
hypothetical  detectors.  However,  though  the  hypothesis  is  reasonable  when  the 
endpoints  differ  by  a single  feature,  it  is  difficult  to  imagine  that  a vowel 
and  a buzz  are  also  distinguished  by  but  a single  feature,  speech/non-speech. 
The  adaptation  technique,  the  only  test  for  feature  detectors,  is,  ironically, 
not  a demonstration  of  feature  detectors  at  all.  It  simply  reveals  that 
certain  perceptual  contrasts,  in  this  particular  case  of  higher  order  proper- 
ties, undergo  selective  alteration  following  saturation. 
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This  study  of  vowel-buzz  adaptation  suggests  that  because  the  hypotheti- 
cal detectors  are  incapable  of  handling  the  result,  and  because  the  detectors 
required  to  handle  it  are  implausible,  selective  adaptation  does  not  depend  on 
the  existence  of  feature  detectors.  If  the  basis  for  adaptation,  and  perhaps 
speech  perception  as  well,  can  be  understood  as  sensitivity  to  the  higher- 
order  values  inherent  in  acoustic  pressure  fluctuations,  without  decomposition 
into  features,  then  the  description  of  such  a process,  not  mere  verification 
of  analytic  features,  is  the  goal  toward  which  further  research  might  well 
proceed . 
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